# Why is my program slow when looping over exactly 8192 elements?

The difference is caused by the same super-alignment issue from the following related questions:

• Why is transposing a matrix of 512×512 much slower than transposing a matrix of 513×513?
• Matrix multiplication: Small difference in matrix size, large difference in timings

But that’s only because there’s one other problem with the code.

Starting from the original loop:

for(i=1;i<SIZE-1;i++)
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
for(k=-1;k<2;k++)
for(l=-1;l<2;l++)
res[j][i] += img[j+l][i+k];
res[j][i] /= 9;
}


First notice that the two inner loops are trivial. They can be unrolled as follows:

for(i=1;i<SIZE-1;i++) {
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j  ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i  ];
res[j][i] += img[j  ][i  ];
res[j][i] += img[j+1][i  ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j  ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}


So that leaves the two outer-loops that we’re interested in.

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

You are iterating the matrix column-wise instead of row-wise.

To solve this problem, you should interchange the two loops.

for(j=1;j<SIZE-1;j++) {
for(i=1;i<SIZE-1;i++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j  ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i  ];
res[j][i] += img[j  ][i  ];
res[j][i] += img[j+1][i  ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j  ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}


This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.

Core i7 920 @ 3.5 GHz

Original code:

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds


Interchanged Outer-Loops:

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds


Categories c++