Fast Math in Six Languages: What We Did and Why It Works

We recently upgraded Elements' LLVM backend and, while we were there, got curious about our math performance. We had our own implementations of sin, cos, exp, and the rest, which were correct, but modern instruction sets offered an opportunity we hadn't yet taken.

We looked at what other toolchains did to get better math performance than the system libm (libmath, ie, inbuilt C RTL.) Our toolchain does not use the system C runtime in general, so we don't delegate to it, even for math.

Among other things, when researching, we found that Rust developers who want vectorised math or to not rely on the system libm can use SLEEF. It's open source, accurate, fast, and supports multiple instruction set levels.

We prototyped it. Got some exciting numbers. We integrated it.

There are not many math libraries of this calibre. There are others: GPL-ed, or commercial, or Intel-only, or ARM-only. SLEEF is absolutely amazing.

The results were significant enough that we wanted to share what we did and why it works.

What We Changed

Two things, which complement each other:

1. A new math library

SLEEF provides vectorised implementations of common math functions (sin, cos, tan, exp, log, pow, sqrt, and so on.)

What does vectorised mean? Rather than processing one value per CPU instruction, a vectorised operation processes multiple values at once. Modern CPUs have wide SIMD registers: 128-bit, or 256-bit with AVX2. A 256-bit register holds four 64-bit doubles, so a single instruction can process four values at once. This is how you get significant speedups on numerical code.

πŸ’‘
There are multiple approaches to parallelization. The most common is threading: do multiple streams of instructions in parallel.

Vectorisation is focused on one stream of instructions (one thread) at a time, but using CPU instructions that can do multiple things at a time. Why multiple four numbers one after the other, when you can multiple all four in one go?

A real bonus is when you start using both at once!

Taking advantage of this in general requires an optimising compiler that uses newer instruction sets – see the next section ;)

Sleef supports both Intel and ARM, and has both scalar (ie normal one by one) and vector (several at a time) versions of math operations. Its scalar functions can dispatch, that is, check the CPU and call a different internal implementation at runtime depending on what your CPU supports.

Its vectorised methods do not: you need to call and link the right ones depending on the target CPU.

Rather than building the math library multiple times for a target (like x64 with multiple levels of instruction sets) instead we use LLVM's support for vectorised math libraries, plus instruction sets, so that a version of the library - IslandMath.lib - contains all the various variants, and at compile-time/link-time our compiler instructs which set to use. This requires really tight integration with LLVM, including:

  • LLVM's target CPU, and CPU feature set (supported instructions LLVM is allowed to use)
  • LLVM intrinsics (because an intrinsic for trunc, say, needs to be able to be converted to this method call – including taking the intrinsic and using it in vectorised form)
  • and LLVM's inbuilt awareness of math libraries: large tables of operations, widths, instruction sets, and methods to link to.

Thus we have a chain. In reverse it is: LLVM (target CPU and feature set, intrinsic, math library support) ⬅️ Elements compiler (specifies target CPU, specifies microarchitecture) ⬅️ EBuild build system (handles projects, passes the compiler its settings, handles automatic references to the math library.) Quite a bit to implement.

For your apps, IslandMath is an implicit reference. Our build system, EBuild, links it automatically. It is visible in the IDE: you always see what is being referenced and linked to in the Water or Fire IDEs, even if it's automatically managed.

2. CPU target selection

You can now tell Elements which CPU level to target:

  • 32-bit Windows: Pentium 4
  • 64-bit Windows: Choose x86-64-v1 through v4 (corresponding roughly to ~2000, ~2008, ~2013, ~2017 eras)
  • ARM64: We handle this internally. Apple Silicon gets M1-specific targeting; Windows/Linux ARM gets generic ARM64.

Our default is x86-64-v2 for 64-bit, and 32-bit uses SSE2: conservative choices that work on any reasonably modern hardware.

These features enhance each other. When you target x86-64-v2, our math library uses SSE2+SSE4.1 implementations. When you target x86-64-v3, it uses AVX2+FMA implementations, which means 256-bit wide operations instead of 128-bit. More data processed per instruction means faster execution.

πŸ’‘
By comparison to other compilers:

Settings to choose instruction set levels are very common – very standard in the industry.

Visual C++ has five different options for Intel. They roughly match the four defined x86_64 hardware levels, though oddly (at least for my VS2022) skipping the 2008 era of the SSE2+SSE4.1 pair (v2) entirely.

Delphi Win32 does not have any options and uses the 80s-era FPU instruction set for math. Delphi Win64 seems to target x86-64-v1 from circa 2000, though I can't find docs explicitly saying so. By contrast C++Builder's new toolchain's default target remained targeting that same level as Delphi by default, but in line with the industry norm added the ability to target one of the newer ones too. Its settings correspond roughly to the standardised x64 versions, with a few intermediate ones as well.

Other compilers and languages have similar options too (here's Rust for example.)

Generally, you pick a setting that is new enough you can get good performance, but old enough all your customers can run it. For example, Windows 10 and 11 minimum hardware means our x64 default of x64-v2 is safe for most people: if Windows itself won't run, why build your apps for something older either? However, we make this available as a setting so you can choose older or newer as suits you, your app, and your customers.

We also tuned our LLVM passes for better vectorisation. By customising passes, as well as customising how passes worked, we were able to ensure more code patterns that could be vectorised actually were vectorised. We believe this is similar to how the Intel C/C++ compiler works, which was famous for performance and we understand – don't quote us – made a similar set of analyses. It certainly worked for us.

The Results

We benchmarked 21 math functions (the standard C RTL set: sin, cos, exp, log, pow, sqrt, etc.), each running 10 million iterations on arrays of 64-bit doubles. We compared Elements to Visual C++ 2022 and Β Delphi 13, running on the same hardware.

Hardware: Intel Tiger Lake i7 (2.80 GHz), Windows 11 Pro 25H2. ARM tests on Apple M2 running Windows 11 Pro 25H2 ARM, via Parallels.

Methodology: Each function runs in a tight, vectorizable loop (if the compiler takes advantage, which ours does.) Each processes an array of 10 million elements with random double values. Only the loop running the math operations is timed; the initialisation plus use of the results (to avoid the loop and results being optimized away) are outside the timing. Where possible, i.e. for Elements, we looked at the IR to verify the code was behaving as we expected. We report the geometric mean of per-function speedups across each toolchain, which gives a single representative number that isn't skewed by outliers.

πŸ’‘
Very importantly, even across large amounts of data, sometimes some toolchains were so fast that a timed loop took 0ms (functioning correctly.) This might look like an infinite speedup but of course that's not true, it's simply too fast to measure.

Why not measure even more data? Because some toolchains we measured against were so slow that it was an agonizing prospect. If something takes 4 seconds on 10 million elements, and we decided to increase 2 orders of magnitude to get a more precise multi-digit millisecond run on our toolchain, that means the other toolchain could be expected to take six minutes – for just one function of twenty-one.

For each compiler comparison, rows that contained such 0ms performance were removed from the speedup calculation. This biases conservatively towards reporting a slower speedup that there might be in reality. We think this is better methodology: we may be faster than, say, 4x overall, but we'll report the more conservative 4x number.

Win32

What we haven't added here is Visual C++'s SSE2 support (see below, roughly the same as the Win64 chart), where they are a shade faster than us, gosh dangum darn it. ...but wait, Win64 is another story entirely. Oh boy:

Win64

Compared to Delphi Win64:

  • Elements Win64 v2 (default): 2.3x faster
  • Elements Win64 v3 (AVX2): 5.5x faster
  • Visual C++ Win64 AVX2: 2.7x faster

At our default settings (x86-64-v2), we're roughly on par with Visual C++: as noted, their SSE2 support is a shade faster than ours. (Compare the Elements x86-64-v2 bar on the chart above with the VC++ SSE2 bar: 2.27x vs 2.6x.) But if you can target x86-64-v3 (most CPUs from ~2013 onwards) and compare against Visual C++ using AVX2, Elements is significantly faster than Visual C++.

Wow!

πŸ’‘
This means for native Intel Win64, Elements Win64 using AVX2 was just over twice as fast as Visual C++.

We are very pleased with this result and invite you to compile your apps with Elements to get the benefit.

You might as well convert your Visual C++ apps to Elements too, while you're at it!

ARM64

Not all toolchains compile for ARM, and their apps run under emulation. This lets us see the performance difference from native ARM combined with our optimizations and math library:

Delphi doesn't compile for Windows ARM yet. If you're running a Delphi Win32 app on Windows ARM under emulation:

  • Elements ARM64 native: 11x faster than Delphi Win32 under emulation

Why Is This Faster Than VC++?

We were curious about this too.

πŸ’‘
First, not all our results were faster than VC++. Windows 64-bit Intel was the key one. ARM64 was also very fast but harder to compare.

Here are some confusing numbers, and their answers: Visual C++ Win64 ARM is 15x faster than Delphi Win64, but we mark Elements ARM64 as 11x faster – yet the Elements Win64 ARM is 1.1x faster than that same Visual C++ test. How does this work – how can that make sense?

It's due to some results being so fast they are 0ms, and so excluded from the comparison; the comparisons (Delphi-vs-Elements, VC-vs-Elements, etc) use different subsets depending on which pair of measurements have non-zero-millisecond elements. This means when calculating the geometric mean, we remove the fastest ones!

In this case, even on our 10-million-iterations benchmark, Elements measured 0ms to execute ceil, floor, sqrt, and min, while Visual C++ had small values (2, 2, 4 and 57ms respectively.) All four of those amazingly fast results are removed when calculating the 'Elements is n times faster' result. VC++ has no results so fast they measure as 0ms and so has no rows omitted when calculating the comparison.

It's safe to say: Elements is fast. Both Win64 Intel and Win64 ARM.

Our best explanation for Intel is that Visual C++'s runtime library likely doesn't have AVX2-optimised math routines, or at least not as aggressively vectorised as SLEEF. VC++'s SSE2 math is actually slightly faster than ours on a few specific functions (sqrt, ceiling, floor), but SLEEF's vectorised implementations pull ahead on the trigonometric and hyperbolic functions, and that's where the overall result (the geometric mean) wins.

This applies to all six Elements languages. We wrote the benchmarks in Oxygene (Object Pascal), but that's the compiler frontend layer: the backend where this takes effect is the same for C#, Swift, Go, Java, and VB.

Practical Impact

These benchmarks are tight loops of floating-point math, ie the best case for this kind of optimisation. Your real-world app probably doesn't consist entirely of math. Yet, your real-world app will have many parts that benefit from the optimizations we offer outside of math.

That said, if you have code that does real number-crunching (data processing, simulations, engineering calculations, audio/video processing) you'll see meaningful improvements. The difference between "my app processes this in 10 seconds" and "my app processes this in 2 seconds" matters.

Try It

πŸ’‘
Elements has a 30-day free trial.

This – performance, modern features and settings, deep-in-LLVM engineering – is the kind of work we do on Elements. If you try it, I think you'll find it shows.