C code loop performance

I noticed in the comments that: The loop takes 5 cycles to execute. It’s “supposed” to take 4 cycles. (since there’s 4 adds and 4 mulitplies) However, your assembly shows 5 SSE movssl instructions. According to Agner Fog’s tables all floating-point SSE move instructions are at least 1 inst/cycle reciprocal throughput for Nehalem. Since you … Read more

How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?

PAUSE notifies the CPU that this is a spinlock wait loop so memory and cache accesses may be optimized. See also pause instruction in x86 for some more details about avoiding the memory-order mis-speculation when leaving the spin-loop. PAUSE may actually stop CPU for some time to save power. Older CPUs decode it as REP … Read more

Are there any smart cases of runtime code modification?

There are many valid cases for code modification. Generating code at run time can be useful for: Some virtual machines use JIT compilation to improve performance. Generating specialized functions on the fly has long been common in computer graphics. See e.g. Rob Pike and Bart Locanthi and John Reiser Hardware Software Tradeoffs for Bitmap Graphics … Read more

Why do ARM chips have an instruction with Javascript in the name (FJCVTZS)?

It is because JS uses double precision for the numbers, but if you want to perform operations with bits, the task is nontrivial, so a specific instruction to convert JS double into integer makes the thing easier. This ARM link explains it very well: https://community.arm.com/processors/b/blog/posts/armv8-a-architecture-2016-additions In order to add more information regarding fuz’s comment, the … Read more