AVX2 is slower than SSE2-4.x under Windows ARM emulation

If you compile your app for AVX2 and it runs on Windows ARM under Prism emulation, is it faster or slower than compiling for SSE2-4.x?

I assumed it would be roughly the same — maybe slightly slower due to emulation overhead, but AVX2's wider operations would compensate. The headline gives it away: I was wrong.

💡

TLDR: AVX2 code runs at 2/3 the speed of equivalent SSE2-SSE4.x optimised code under emulation on Windows 11 ARM.

'Should I compile for AVX2 if my app might run on Windows ARM?' has a clear answer: No. At least if performance matters.

This post explains how I found out, what I measured and how, the benchmark results, and why.

Curiosity

A few weeks ago, in a Hacker News thread on WoW (the game) emulated performance on Windows ARM, I wondered:

I’ve been testing some math benchmarks on ARM emulating x64, and saw very little performance improvement with the AVX2+FMA builds, compared to the SSE4.x level. (X64 v2 to v3.) ... I’ve found very little info online about this.

Well, I nerdsniped myself, because those math benchmarks are now complete and so we have the perfect framework for testing AVX2+FMA emulation performance overhead on ARM Windows. I have no technical reason to do so: if you use our compiler we encourage that if you want to run your app on Windows ARM to just compile your app for Windows ARM. It's simply: I want to know.

Thus I spent much of Sunday crunching our data and figuring it out.

ARM emulation of x86

You can skip this bit if you know about Windows ARM's emulation and what various Intel instruction sets like SSE through AVX2 are: go forward to Benchmarks.

Windows 11 lets you run both 32-bit and 64-bit Intel apps on ARM. It does this via emulation. Essentially, x86/64 code is translated on the fly into ARM. Windows 10 supported emulating 32-bit Intel, and by 2021 Windows 11 introduced emulating 64-bit apps.

In 2024 Windows 11 was updated with a new emulation layer, Prism. The main user-facing change seems to have been performance: 'Microsoft told Ars Technica that Prism is as fast as Apple’s Rosetta 2' and:

Most x86 apps now run without issues, and in many cases don't even feel like they're being emulated. These days, the majority of users won't notice a difference between using an Intel PC or a Snapdragon one
– Windows Central

Is emulation complete / entire?

x86 and x86_64 have not always remained the same. Over time they add more functionality, which is exposed as instruction sets. These are the base instructions that an app can be compiled to use and are often focused around doing things faster. For example, the x87 floating point math instruction set still exists (it was introduced in the 1980s!) but was succeeded a quarter century ago by SSE2, introduced with the Pentium 4. SSE2 lets you perform floating point math operations much faster. A few years later the SSE 4.x series also improved largely integer-based operations. This is a very handwavy summary: in fact, these are part of a wide series of instructions intended to process data fast, where possible in wider configurations (more data at a time per clock tick) than older instructions, each new improvement introduced one by one over many years. This does not even begin to address the other supplementary extensions: ones for bit manipulation, specific math patterns like multiply-add, and more.

This is important to understand because software does not all use the same sets of instructions. Figuring out baseline standards of which sets it was reasonable for an app to use was a mess, and Linux folk found it annoying enough that, working together with Intel and AMD, Red Hat and SUSE defined standardised versions to allow known safe targets for compiling for specific collections of instruction sets. Thanks to them, x86_64 now has four main versions that modern compilers target, which define generations of new instructions. x64 version 1 is that same old year-2000-ish era; x64 version 2 circa 2008 level, v3 circa 2013 level, and v4 circa 2017 level.

It takes time for instructions to become mainstream: if you are building software for 64bit Intel Windows in general, you likely won't build solely for the v4 2017 level because you may have users who have older computers, or ones newer than 2017 but which had less capable chips that didn't feature all the latest instruction sets. AVX-512 (v4) is a wonderful instruction set for very wide vectorised behaviour but still many computers in practical use today don't have it.

💡

What happens if you run software that uses (targets) instructions that don't exist on your CPU? You get an illegal instruction exception and your app terminates.

Luckily, most apps are written targeting older x64 versions with broad support.

Some apps actually target multiple versions at once through something called target_clones, where for some critical parts of the app the compiler will generate the same code multiple times, each one optimised for a different generation of CPU, and at runtime it will choose which one to use.

And similarly to how actual hardware may or may not support specific instruction sets, Windows x86 emulation also supported only a subset.

That subset was approximately x64 version 2 (ie with SSE2 and 4.x) and only recently have newer versions of Windows have supported v3: to handwave, AVX2 and FMA. This is new, exciting, and to my knowledge, largely unknown.

We are comparing performance using x86-64-2 level vs v3 level running emulated on ARM.

That brings us to that Hacker News comment and today. What's the emulation performance for those newer instructions?

Our benchmarks

At RemObjects we make a multi-language (6 of them!) compiler toolchain that targets native CPUs via LLVM (as well as .NET, JVM, and WASM backends.) We recently integrated a new vectorised math library, which gave us the perfect benchmark framework for testing this.

We already supported ARM Windows, that is not new, but our cross-platform RTL had our own implementations of common math methods on the 'Island' (native) platform (the normal set: sin/pow/exp/floor and so forth.) These were correct in that they followed known algorithms, but we felt there was room for performance optimisation. We settled on integrating a third party open source math library. At the time I wrote the above HN comment, this integration was still being tested and tweaked; we even changed some of LLVM's internal passes.

Today, we have this new math implemented for macOS ARM (and x64), and Windows i386, x86_64, and ARM64. Our Windows 32-bit (i386) math supports SSE2/4.x, but the math library for Windows x64 supports using either v2 (SSE2-4.x) or v3 (AVX2-targeted) level depending on the x64 revision you target in the compiler options. (You can actually tell Elements to target v1 through v4, but our new math library kicks in at v2, with more performance with v3, and we have not enabled anything extra in math for v4; you can certainly compile allowing AVX-512 etc if you wish for your code in general though.)

As part of checking our new math code as (a) correct and (b) improved performance, we have concrete data running 21 different math operations on both real x64 hardware, and Windows ARM on Parallels on a Mac M2.

Because these are different machines we cannot compare the wall clock time, but we can compare the relative time, using the SSE2-4.x level as a basis: what is the relative performance difference of using AVX2(+FMA) on Intel vs on ARM? Normalising against the earlier, well-emulated instruction set means that the difference between emulated and real hardware gives us the answer of how well AVX2 emulation performs.

💡

To rephrase: Emulated and not-emulated run on different hardware. So how can we make them comparable? By using something they both do as a basis: both already ran x64 v2 (SSE2-4.x) code. Emulated v2 may be slower, sure, but if we normalise v2's performance as 1.0 on each test platform, we get the v3 (AVX2) performance as a comparable number.

We can also use ARM64 on ARM, and AVX2 on x64, as a handwavy comparison for if you need ARM: an indication if emulation provides 'good enough' performance or if there's real value in compiling for ARM.

Thus:

Using SSE2-4.x as a baseline for performance on both ARM and x64 (ie, scaled this to 1), what is the relative performance of AVX2 when emulated on ARM vs running on x64, and thus what overhead does ARM emulation of AVX2 provide compared to native?
Using two kinda similar-gen machines, and hand-waving that it's nothing more accurate than that, how well does ARM-native vs x64-emulated vs x64-native perform? Do you need to compile for ARM, or can you get away with letting your apps run under Windows emulation?

Details

Performance tests instruct LLVM to build targeting the specific instruction set; are heavily vectorised; and our AVX2 level includes FMA (fused multiply-add.) We use only 256-bit wide AVX2, not the 128-bit wide instructions. Specifics are:

Normalised to 1.0: CPU x86-64-v2 (referred to as 'SSE2-4.x' above, because those are the primary instructions math uses), feature set: +cx16,+popcnt,+sahf,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2
AVX2+FMA comparison: x86-64-v3, feature set: +cx16,+popcnt,+sahf,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+fma,+bmi,+bmi2,+f16c,+lzcnt,+movbe,+xsave
ARM64: 'generic' ARM64 CPU, no specific feature set

💡

If someone wants to be nerdsniped: an ARM CPU of '' (generic) gives the best performance by far vs instructing LLVM to target a specific CPU, like one of the Windows-compatible should-be-M-series-subset Cortex A8.x or Snapdragon CPUs. On trivial operations like ceil() or floor() for which there should be inbuilt single-op instructions, this can be a 20x difference.

Why?

Each math operation is run on a randomly initialised array of 64-bit doubles as input, in a loop (ie intended to be vectorised), 10 million times. The output is retained and 'used' in order to prevent the loop being optimised away. In IR, we verified the loop exists, is vectorised, and appears to look as expected. Numbers reported are typical timing runs, with no observable difference in cold vs warm runs. Timing is of course only around the tight loop itself, not the prolog or epilog setting the data up or 'using' the results.

All the 21 results are then scaled as a ratio vs baseline instruction set, x64 v2, and the geometric mean calculated to give a single representative number. This means that we can tell how much faster AVX2-level code is vs SSE2-4.x level, both on actual Intel hardware, and under ARM emulation.

💡

For example, if we get a result (these numbers are fictional examples) that on Intel AVX2 code runs 2x faster than SSE2-4.x, and on ARM it runs 1.6x faster, we could say that under emulation we lose 40% of the real hardware's performance, or that the emulation overhead is such that it allows only 60% of the native hardware's benefit.

Test machines

x64: Tiger Lake i7 (2.80 GHz), mobile-class CPU, circa 2021, on Windows 11 Pro 25H2.

ARM: Apple M2, circa 2022, macOS Tahoe 26.1 with the ARM version of the same version of Windows 11 Pro (25H2), running on Parallels 26.

💡

These are not directly comparable: they are up to eighteen months apart in release date; despite being laptops Apple's mobile-class CPUs are likely 'better' than Intel's of the same time period, etc. We are regarding them as fairly comparable: both are laptops, both are within 18 months of the same manufacture date, etc.

So: Technically comparable? No, definitely not: that's another reason to normalise, in order to get quantitatively comparable results. Real-world, in practice, 'people have a computer they bought: do we get at least near the same order of magnitude performance' qualitatively comparable? Ie to answer the 'is emulation enough or is my app losing out by not compiling for ARM?' question? Sure.

Plus, it's what I had available to test on without pestering too many colleagues to try to find something else. Most of us run Macs, not too many Intels left here. 🫣

x86_64 AVX2 emulation on ARM: Results

The following chart scales x64 v2 to 1.0 (grey baseline) and x64 v3 i(ie AVX2+FMA) is relative to that (Intel in green, ARM emulation in blue), that relative number calculated as the geometric mean of the result ratios of x64 v3 vs v2 for 21 common math functions run on 64-bit doubles per the above detailed description.

Chart showing Win64 v2 normalised to 1.0; Win64 v3 (AVX2) on native Intel is 2.7x faster, but emulated on ARM is 0.66 as fast, ie, two thirds the speed of the older instruction set. — Larger is better (faster)

Normalised to 1.0, larger is better (faster).

As expected, using AVX2 on native Intel is significantly faster: 2.7 times faster.

But when the same code is run on ARM, the AVX2 implementations are notably slower than SSE2-4.x. They are almost exactly 2/3 as performant as emulating the older instruction set.

This means if your app uses x64 v3 with AVX2, and runs on emulated ARM on Windows, following this data, it will run slower than if you restrict it to the x64 v2 compilation level.

Why?

At the time I wrote the HN comment that started this, I had noticed that we didn't see faster performance using the AVX2 versions of our math function on ARM; what I had not yet measured was such a significant slowness.

💡

Perception is interesting: before running actual measurements, I could tell it wasn't faster. I guessed it was about the same. I did not realise AVX2 was so much slower.

💡

There was one notable outlier: the emulated version of exp() ran in 2/3 the time of the Intel one, ie was faster emulated. All other operations were noticeably slower.

Some possible reasons are:

ARM has 128-bit wide NEON operations. AVX2 is using 256-bit wide operations (our code is not using 128-bit widths). This means the emulation code has to handle running two halves; it would be very close to impossible to make that equal performance.
This is in my estimation the most likely single reason.
The Prism AVX2 emulation code is new, compared to that emulating older instruction sets, and may not yet be fully optimized.
It may optimise for heavily for 32-bit singles, not 64-bit doubles. Our math library focuses on double-precision.
The ARM emulation documentation notes, 'Prism is optimized and tuned specifically for Qualcomm Snapdragon processors. Some performance features within Prism require hardware features only available in the Snapdragon X series...'
These tests were on an Apple M2. While I'm sure Microsoft wants to support, or even prioritise, non-Apple hardware, in my view, Windows on Mac (via Parallels) is worth them supporting. But perhaps they don't support it well, yet.
The emulation code may look for specific patterns that our code does not meet: perhaps, and this is entirely speculation, known (eg) VC++ Intel output, such as common RTL methods, may map to specific ARM64 patterns in the emulator. We are using LLVM, and (probably) rarer math implementations, therefore, less common sequences of operations.
I cannot speak to how the emulator is implemented, and if this is even likely. It's a guess.
Our x64 v3 performance seems, at this stage of testing, to be faster than Visual C++'s. (By a significant factor: in this chart, where our AVX2 code at our maximum FP precision came in at 2.7x the baseline, a VC++ 2022 version of the same benchmark compiled in x64 release mode, default FP accuracy, no further changes to default settings came in at 1.3x running on native Intel hardware. Yes, that is a 2x difference, and we attribute this to the VC runtime likely lacking AVX2-optimised math routines; their SSE2 math is slightly faster than ours, gosh dangnum darn it.)
Therefore, for AVX2, using our compiler may be unfair compared to using other tools.

Meaning in practice

It is rare to have tight loops of operations: data processing, scientific and engineering software, etc., are the most likely. Even games may push much to the GPU. Therefore, the impact you see on your AVX2 app is unlikely to represent your app running at 2/3 the speed of your AVX2-4.x app. Only parts of it will.

If you have apps that do real number-crunching, whether that's native or something like Python, you should have the ARM version installed. (You likely have the ARM64 Python wheels installed anyway. But check.)

Although this checked floating point math, it is likely that it also applies to integer instruction set emulation too.

Therefore, it's likely worth treating Windows' ARM emulation of AVX2-level support as for support and compatibility, not for equal performance. To get performance, you'll need to compile as ARM.

💡

One major worry is if your software detects AVX2 (general x64 v3 level) and uses those implementations dynamically, through something like target_clones. If so, you may be accidentally hurting your app's performance.

If performance is key: Yes, it is absolutely key to build your app as ARM, not to rely on Windows ARM emulation.

💡

Thanks for reading! Check out Elements, which lets you use C#, Java, Go, Swift, VB and Object Pascal on Windows, Mac and Linux, targeting native i386, x64, and ARM64 (with our shiny new math!) as well as .NET, WASM and the JVM.