C code loop performance

I noticed in the comments that:

The loop takes 5 cycles to execute.
It’s “supposed” to take 4 cycles. (since there’s 4 adds and 4 mulitplies)

However, your assembly shows 5 SSE movssl instructions. According to Agner Fog’s tables all floating-point SSE move instructions are at least 1 inst/cycle reciprocal throughput for Nehalem.

Since you have 5 of them, you can’t do better than 5 cycles/iteration.

So in order to get to peak performance, you need to reduce the # of loads that you have. How you can do that I can’t see immediately this particular case – but it might be possible.

One common approach is to use tiling. Where you add nesting levels to improve locality. Although it’s used mostly for improving cache access, it can also be used in registers to reduce the # of load/stores that are needed.

Ultimately, your goal is to reduce the number of loads to be less than the numbers of add/muls. So this might be the way to go.

Leave a Comment Cancel reply