Here are theoretical max FLOPs counts (**per core**) for a number of recent processor microarchitectures and explanation how to achieve them.

In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply

`(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA)`

.

Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything *else*. Modern CPUs have such high FMA throughput that there isn’t much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.

If your workload includes any ADD/SUB or MUL that can’t be contracted into FMAs, the theoretical max numbers aren’t an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.

### Intel

Note that Celeron/Pentium versions of recent microarchitectures don’t support AVX or FMA instructions, only SSE4.2.

Intel Core 2 and Nehalem (SSE/SSE2):

- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

Intel Sandy Bridge/Ivy Bridge (AVX1):

- 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
- 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication

Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/… (AVX+FMA3):

- 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
- 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
- (Using 256-bit vector instructions can reduce max turbo clock speed on some CPUs.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (**AVX512F**) with **1 FMA units**: some Xeon Bronze/Silver

- 16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
- 32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
- Same computation throughput as with narrower 256-bit instructions, but speedups can still be possible with AVX512 for wider loads/stores, a few vector operations that don’t run on the FMA units like bitwise operations, and wider shuffles.
- (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also
**reduces the max turbo clock speed**, so “cycles” isn’t a constant in your performance calculations.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (**AVX512F**) with **2 FMA units**: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.

- 32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
- 64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
- (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed.)

Future: Intel Cooper Lake (successor to Cascade Lake) is expected to introduce Brain Float, a float16 format for neural-network workloads, with support for actual SIMD computation on it, unlike the current F16C extension that only has support for load/store with conversion to float32. This should double the FLOP/cycle throughput vs. single-precision on the same hardware.

Current Intel chips only have actual computation directly on standard float16 in the iGPU.

### AMD

AMD K10:

- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

- 8 DP FLOPs/cycle: 4-wide FMA
- 16 SP FLOPs/cycle: 8-wide FMA

AMD Ryzen

- 8 DP FLOPs/cycle: 4-wide FMA
- 16 SP FLOPs/cycle: 8-wide FMA

### x86 low power

Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

- 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
- 6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle

AMD Bobcat:

- 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
- 4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle

AMD Jaguar:

- 3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
- 8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle

### ARM

ARM Cortex-A9:

- 1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
- 4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle

ARM Cortex-A15:

- 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
- 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

Qualcomm Krait:

- 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
- 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

### IBM POWER

IBM PowerPC A2 (Blue Gene/Q), per core:

- 8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
- SP elements are extended to DP and processed on the same units

IBM PowerPC A2 (Blue Gene/Q), per thread:

- 4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
- SP elements are extended to DP and processed on the same units

### Intel MIC / Xeon Phi

Intel Xeon Phi (Knights Corner), per core:

- 16 DP FLOPs/cycle: 8-wide FMA every cycle
- 32 SP FLOPs/cycle: 16-wide FMA every cycle

Intel Xeon Phi (Knights Corner), per thread:

- 8 DP FLOPs/cycle: 8-wide FMA every other cycle
- 16 SP FLOPs/cycle: 16-wide FMA every other cycle

Intel Xeon Phi (Knights Landing), per core:

- 32 DP FLOPs/cycle: two 8-wide FMA every cycle
- 64 SP FLOPs/cycle: two 16-wide FMA every cycle

The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.