Why is __int128_t faster than long long on x86-64 GCC?

The performance difference comes from the efficiency of 128-bit divisions/modulus with GCC/Clang in this specific case. Indeed, on my system as well as on godbolt, sizeof(long long) = 8 and sizeof(__int128_t) = 16. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, … Read more

Divide by 10 using bit shifts?

Editor’s note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182 (Godbolt). It is exact for inputs smaller than 0x40000005, though, which may be sufficient for some uses. Compilers (including MSVC) do use fixed-point multiplicative inverses for constant … Read more

Assembly Language – How to do Modulo?

If your modulus / divisor is a known constant, and you care about performance, see this and this. A multiplicative inverse is even possible for loop-invariant values that aren’t known until runtime, e.g. see https://libdivide.com/ (But without JIT code-gen, that’s less efficient than hard-coding just the steps necessary for one constant.) Never use div for … Read more