Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date. See the end of this answer for the 2017 update. Original Answer (2013): Because you’re bottlenecked by memory bandwidth. While vectorization and other micro-optimizations can improve the speed of computation, … Read more

CPU SIMD vs GPU SIMD?

Both CPUs & GPUs provide SIMD with the most standard conceptual unit being 16 bytes/128 bits; for example a Vector of 4 floats (x,y,z,w). Simplifying: CPUs then parallelize more through pipelining future instructions so they proceed faster through a program. Then next step is multiple cores which run independent programs. GPUs on the other hand … Read more

Why is strcmp not SIMD optimized?

In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte. If you scan for the length of the string you have already accomplished most of the … Read more

Difference between MOVDQA and MOVAPS x86 instructions?

In functionality, they are identical. On some (but not all) micro-architectures, there are timing differences due to “domain crossing penalties”. For this reason, one should generally use movdqa when the data is being used with integer SSE instructions, and movaps when the data is being used with floating-point instructions. For more information on this subject, … Read more

AVX2 what is the most efficient way to pack left based on a mask?

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.) We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle. We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we … Read more

How to determine if memory is aligned?

#define is_aligned(POINTER, BYTE_COUNT) \ (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0) The cast to void * (or, equivalenty, char *) is necessary because the standard only guarantees an invertible conversion to uintptr_t for void *. If you want type safety, consider using an inline function: static inline _Bool is_aligned(const void *restrict pointer, size_t byte_count) { … Read more

Where can I find an official reference listing the operation of SSE intrinsic functions?

As well as Intel’s vol.2 PDF manual, there is also an online intrinsics guide. The Intel® Intrinsics Guide contains reference information for Intel intrinsics, which provide access to Intel instructions such as Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), and Intel® Advanced Vector Extensions 2 (Intel® AVX2). It has a … Read more