Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

L1 is very tightly coupled to the CPU core, and is accessed on every memory access (very frequent). Thus, it needs to return the data really fast (usually within on clock cycle). Latency and throughput (bandwidth) are both performance-critical for L1 data cache. (e.g. four cycle latency, and supporting two reads and one write by … Read more

Why is linear read-shuffled write not faster than shuffled read-linear write?

This is a complex problem closely related to architectural features of modern processors and your intuition that random read are slower than random writes because the CPU has to wait for the read data is not verified (most of the time). There are several reasons for that I will detail. Modern processors are very efficient … Read more

Do current x86 architectures support non-temporal loads (from “normal” memory)?

To answer specifically the headline question: Yes, recent1 mainstream Intel CPUs support non-temporal loads on normal 2 memory – but only “indirectly” via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions3 directly. The basic … Read more