The difference is caused by the same super-alignment issue from the following related questions:

- Why is transposing a matrix of 512×512 much slower than transposing a matrix of 513×513?
- Matrix multiplication: Small difference in matrix size, large difference in timings

But that’s only because there’s one other problem with the code.

Starting from the original loop:

```
for(i=1;i<SIZE-1;i++)
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
for(k=-1;k<2;k++)
for(l=-1;l<2;l++)
res[j][i] += img[j+l][i+k];
res[j][i] /= 9;
}
```

First notice that the two inner loops are trivial. They can be unrolled as follows:

```
for(i=1;i<SIZE-1;i++) {
for(j=1;j<SIZE-1;j++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}
```

So that leaves the two outer-loops that we’re interested in.

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

You are iterating the matrix column-wise instead of row-wise.

To solve this problem, you should interchange the two loops.

```
for(j=1;j<SIZE-1;j++) {
for(i=1;i<SIZE-1;i++) {
res[j][i]=0;
res[j][i] += img[j-1][i-1];
res[j][i] += img[j ][i-1];
res[j][i] += img[j+1][i-1];
res[j][i] += img[j-1][i ];
res[j][i] += img[j ][i ];
res[j][i] += img[j+1][i ];
res[j][i] += img[j-1][i+1];
res[j][i] += img[j ][i+1];
res[j][i] += img[j+1][i+1];
res[j][i] /= 9;
}
}
```

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.

**Core i7 920 @ 3.5 GHz**

Original code:

```
8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds
```

Interchanged Outer-Loops:

```
8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds
```