simd – Row Coding

Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

September 20, 2023 by Tarik

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date. See the end of this answer for the 2017 update. Original Answer (2013): Because you’re bottlenecked by memory bandwidth. While vectorization and other micro-optimizations can improve the speed of computation, … Read more

CPU SIMD vs GPU SIMD?

September 15, 2023 by Tarik

Both CPUs & GPUs provide SIMD with the most standard conceptual unit being 16 bytes/128 bits; for example a Vector of 4 floats (x,y,z,w). Simplifying: CPUs then parallelize more through pipelining future instructions so they proceed faster through a program. Then next step is multiple cores which run independent programs. GPUs on the other hand … Read more

Why is strcmp not SIMD optimized?

September 13, 2023 by Tarik

In a SSE2 implementation, how should the compiler make sure that no memory accesses happen over the end of the string? It has to know the length first and this requires scanning the string for the terminating zero byte. If you scan for the length of the string you have already accomplished most of the … Read more

Difference between MOVDQA and MOVAPS x86 instructions?

August 31, 2023 by Tarik

In functionality, they are identical. On some (but not all) micro-architectures, there are timing differences due to “domain crossing penalties”. For this reason, one should generally use movdqa when the data is being used with integer SSE instructions, and movaps when the data is being used with floating-point instructions. For more information on this subject, … Read more

Why is np.dot so much faster than np.sum?

August 13, 2023 by Tarik

numpy.dot delegates to a BLAS vector-vector multiply here, while numpy.sum uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements. I don’t know what BLAS library your NumPy is using, but a good BLAS will generally take advantage of SIMD operations, while numpy.sum doesn’t do … Read more

ARM Cortex-A8: Whats the difference between VFP and NEON

August 2, 2023 by Tarik

There are quite some difference between the two. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor as part of the ARM core. It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. Since there is parallelism inside the Neon, you can … Read more

AVX2 what is the most efficient way to pack left based on a mask?

August 2, 2023 by Tarik

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.) We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle. We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we … Read more

How to determine if memory is aligned?

July 22, 2023 by Tarik

#define is_aligned(POINTER, BYTE_COUNT) \ (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0) The cast to void * (or, equivalenty, char *) is necessary because the standard only guarantees an invertible conversion to uintptr_t for void *. If you want type safety, consider using an inline function: static inline _Bool is_aligned(const void *restrict pointer, size_t byte_count) { … Read more

Where can I find an official reference listing the operation of SSE intrinsic functions?

July 16, 2023 by Tarik

As well as Intel’s vol.2 PDF manual, there is also an online intrinsics guide. The Intel® Intrinsics Guide contains reference information for Intel intrinsics, which provide access to Intel instructions such as Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), and Intel® Advanced Vector Extensions 2 (Intel® AVX2). It has a … Read more

Getting started with Intel x86 SSE SIMD instructions

July 14, 2023 by Tarik

First, I don’t recommend on using the built-in functions – they are not portable (across compilers of the same arch). Use intrinsics, GCC does a wonderful job optimizing SSE intrinsics into even more optimized code. You can always have a peek at the assembly and see how to use SSE to it’s full potential. Intrinsics … Read more