## Timing square root

The square root is one of those basic mathematical operations that’s totally ubiquitous in any game’s source code, and yet also has many competing implementations and performance superstitions around it. The compiler offers a sqrt() builtin function, and so do some CPUs, but some programmers insist on writing their own routines in software. And often it’s really the reciprocal square root you want, for normalizing a vector, or trigonometry. But I’ve never had a clear answer for which technique is really fastest, or exactly what accuracy-vs-speed tradeoffs we make with “estimating” intrinsics.

What is the fastest way to compute a square root? It would seem that if the CPU has a native square-root opcode, there’s no beating the hardware, but is it really true?

Such questions vex me, so I went and measured all the different means of computing the square root of a scalar single-precision floating point number that I could think of. I ran trials on my Intel Core 2 and on the Xenon, comparing each technique for both speed and accuracy, and some of the results were surprising!

In this article I’ll describe my results for the Intel hardware; next week I’ll turn to the Xenon PPC.

## Experimental setup

I’ll post the whole source code for my tests elsewhere, but basically each of these trials consists of iterating N times over an array of floating point numbers, calling square root upon each of them and writing it to a second output array.

[DDET (see pseudocode)]

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 inline float TestedFunction( float x ) { return sqrt(x); // one of many implementations.. } void TimeSquareRoot() { float numbersIn[ ARRAYSIZE ]; // ARRAYSIZE chosen so that both arrays float numbersOut[ ARRAYSIZE ]; // fit in L1 cache // assume that numbersIn is filled with random positive numbers, and both arrays are // prefetched to cache... StartClockCycleCounter(); for ( int i = 0 ; i < NUMITERATIONS ; ++i ) for ( int j = 0 ; j < ARRAYSIZE ; ++j ) // in some cases I unroll this loop { numbersOut[j] = TestedFunction( numbersIn[j] ); } StopClockCycleCounter(); printf( "%.3f millisec for %d floatsn", ClockCycleCounterInMilliseconds(), ARRAYSIZE * NUMITERATIONS );   // now measure accuracy float error = 0; for ( int i = 0 ; i < ARRAYSIZE ; ++i ) { double knownAccurate = PerfectSquareRoot( numbersIn[i] ); error += fabs( numbersOut[i] - knownAccurate ) / knownAccurate ; } error /= ARRAYSIZE ; printf( "Average error: %.5f%%n", error * 100.0f ); }```

[/DDET]

In each case I verified that the compiler was not eliding any computations (it really was performing ARRAYSIZE × NUMITERATIONS many square roots), that it was properly inlining the tested function, and that all the arrays fit into L1 cache so that memory latency wasn’t affecting the results. I also only tested scalar square root functions — SIMD would clearly be the fastest way of working on large contiguous arrays, but I wanted to measure the different techniques of computing one square root at a time, as is usually necessary in gameplay code.

Because some of the speedup techniques involve trading off accuracy, I compared the resulting numbers against the perfectly-accurate double-precision square root library routine to get an average error for each test run.

And I performed each run multiple times with different data, averaging the final results together.

## x86 results

I ran my tests on a 2.66Ghz Intel Core 2 workstation. An x86 chip actually has two different means of performing scalar floating-point math. By default, the compiler uses the old x87 FPU, which dates back to 1980 with a stack-based instruction set like one of those old RPN calculators. In 1999, Intel introduced SSE, which added a variety of new instructions to the processor. SSE is mostly thought of as a SIMD instruction set — for operating on four 32-bit floats in a single op — but it also includes an entire set of scalar floating point instructions that operate on only one float at a time. It’s faster than the x87 operations and was meant to deprecate the old x87 pathway. However, both the MSVC and GCC compilers default to exclusively using the x87 for scalar math, so unless you edit the “code generation” project properties panel (MSVC) or provide a cryptic obscure command line option (GCC), you’ll be stuck with code that uses the old slow way.

I timed the following techniques for square root:

1. The compiler’s built in `sqrt()` function (which compiled to the x87 FSQRT opcode)
2. The SSE “scalar single square root” opcode `sqrtss`, which MSVC emits if you use the `_mm_sqrt_ss` intrinsic or if you set `/arch:SSE2`
3. The “magic number” approximation technique invented by Greg Walsh at Ardent Computer and made famous by John Carmack in the Quake III source code.
4. Taking the estimated reciprocal square root of a via the SSE opcode `rsqrtss`, and multiplying it against a to get the square root via the identity x / √x = √x.
5. Method (4), with one additional step of Newton-Raphson iteration to improve accuracy.
6. Method (5), with the loop at line 13 of the pseudocode above unrolled to process four floats per iteration.

I also tested three ways of getting the reciprocal square root: Carmack’s technique, the `rsqrtss` SSE op via compiler intrinsic, and `rsqrtss` with one Newton-Raphson step.

The results, for 4096 loops over 4096 single-precision floats, were:

SQUARE ROOT

Method Total time Time per float Avg Error
Compiler `sqrt(x)` /
x87 FPU `FSQRT`
404.029ms 24ns 0.0000%
SSE intrinsic `ssqrts` 200.395ms 11.9ns 0.0000%
Carmack’s Magic Number rsqrt * x 72.682ms 4.33ns 0.0990%
SSE `rsqrtss` * x 20.495ms 1.22ns 0.0094%
SSE `rsqrtss` * x
with one NR step
53.401ms 3.18ns 0.0000%
SSE `rsqrtss` * x
with one NR step, unrolled by four
48.701ms 2.90ns 0.0000%

RECIPROCAL SQRT

Method Total time Time per float Avg Error
Carmack’s Magic Number rsqrt 59.378ms 3.54ns 0.0990%
SSE `rsqrtss` 14.202ms 0.85ns 0.0094%
SSE `rsqrtss`
with one NR step
45.952ms 2.74ns 0.0000%

## Discussion

Looking at these results, it’s clear that there’s a dramatic difference in performance between different approaches to performing square root; which one you choose really can have a significant impact on framerate and accuracy. My conclusions are:

Don’t trust the compiler to do the right thing. The received wisdom on performance in math functions is usually “don’t reinvent the wheel; the library and compiler are smart and optimal.” We see here that this is completely wrong, and in fact calling the library `sqrt(x)` causes the compiler to do exactly the worst possible thing. The compiler’s output for ` y = sqrt(x); ` is worse by orders of magnitude compared to any other approach tested here.

The x87 FPU is really very slow. Intel has been trying to deprecate the old x87 FPU instructions for a decade now, but no compiler in the business defaults to using the new, faster SSE scalar opcodes in place of emulating a thirty-year-old 8087. In the case of ` y = sqrt(x) `, by default MSVC and GCC emit something like

```fld DWORD PTR [ecx] fsqrt ;; slow x87 flop fstp DWORD PTR [eax]```

But if I set the `/arch:SSE2` option flag, telling the compiler “assume this code will run on a machine with SSE2”, it will instead emit the following, which is 2x faster.

```sqrtss xmm0, DWORD PTR [ecx] ;; faster SSE scalar flop movss DWORD PTR [eax], xmm0```

There was a time when not every PC on the market had SSE2, meaning that there was some sense in using the older, more backwards-compatible operations, but that time has long since passed. SSE2 was introduced in 2001 with the Pentium 4. No one is ever going to try to play your game on a machine that doesn’t support it. If your customer’s PC has DirectX 9, it has SSE2.

You can beat the hardware. The most surprising thing about these results for me was that it is faster to take a reciprocal square root and multiply it, than it is to use the native sqrt opcode, by an order of magnitude. Even Carmack’s trick, which I had assumed was obsolete in an age of deep pipelines and load-hit-stores, proved faster than the native SSE scalar op. Part of this is that the reciprocal sqrt opcode `rsqrtss` is an estimate, accurate to twelve bits; but it only takes one step of Newton’s Method to converge that estimate to an accuracy of 24 bits while still being four times faster than the hardware square root opcode.

The question that then bothered me was, why is SSE’s built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations? The first hint came when I tried unrolling the loop so that it performed four ops inside the inner for():

``` for ( int i = 0 ; i < NUMITERATIONS ; ++i ) for ( int j = 0 ; j < ARRAYSIZE ; j += 4 ) // in some cases I unroll this loop { numbersOut[j + 0] = TestedSqrt( numbersIn[j + 0] ); numbersOut[j + 1] = TestedSqrt( numbersIn[j + 1] ); numbersOut[j + 2] = TestedSqrt( numbersIn[j + 2] ); numbersOut[j + 3] = TestedSqrt( numbersIn[j + 3] ); }   // two implementations of```

As you can see from the results above, when TestedSqrt was the `rsqrtss` followed by a multiply and one step of Newton iteration, unrolling the loop this way provided a modest 8.8% improvement in speed. But when I tried the same thing with the “precise square root” op `sqrtss`, the difference was negligible:

```SSE sqrt: 200.395 msec
average error 0.0000%

SSE sqrt, unrolled four: 196.741 msec
average error 0.0000%
```

What this suggests is that unrolling the loop this way allowed the four rsqrt paths to be pipelined, so that while an individual `rsqrtss` might take 6 cycles to execute before its result was ready, other work could proceed during that time so that the four square root operations in the loop overlapped. On the other hand, the non-estimated `sqrtss` op apparently cannot be overlapped; one sqrt must finish before the next can begin. A look at the Intel® 64 and IA-32 Architectures Optimization Reference Manual confirms: `sqrtss` is an unpipelined instruction.

Pipelined operations make a big difference. When the CPU hits an unpipelined instruction, every other instruction in the pipeline has to stop and wait for it to retire before proceeding, so it’s like putting the handbrake on your processor. You can identify nonpipelined operations in appendix C of the Optimization Reference Manual as the ones that have a throughput equal to latency and greater than 4 cycles.

In the case of `ssqrt`, the processor is probably doing the same thing internally that I’m doing in my “fast” function — taking an estimated reciprocal square root, improving it with Newton’s method, and then multiplying it by the input parameter. Taken all together, this is far too much work to fit into a single execution unit, so the processor stalls until it’s all done. But if you break up the work so that each of those steps is its own instruction, then the CPU can pipeline them all, and get a much higher throughput even if the latency is the same.

Pipeline latency and microcoded instructions are a much bigger deal on the 360 and PS3, whose CPUs don’t reorder operations to hide bubbles; there the benefit from unrolling is much greater, as you’ll see next week.

## Conclusion

Not all square root functions are created equal, and writing your own can have very real performance benefits over trusting the compiler to optimize your code for you (at which it fails miserably). In many cases you can trade off some accuracy for a massive increase in speed, but even in those places where you need full accuracy, writing your own function to leverage the `rsqrtss` op followed by Newton’s method can still give you 32 bits of precision at a 4x-8x improvement over what you will get with the built-in `sqrtf()` function.

And if you have lots of numbers you need to square root, of course SIMD (`rsqrtps`) will be faster still.

1. DFS says:

Excellent excellent post! Looking forward to the 360 results.

2. LeeN says:

Might be worth while to make your simulation closer to the way they would be used in the real world. In my experience, rarely are we lucky enough to only have to do a square root (or any arithmetic operation) on an array of floats. In AI for example, you may only update some information once as a optimization and so in the middle of some code decide to do a distance check. With a circular HUD minimap in a game you might want to place an icon at the edge of the minimap to indicate the direction of a goal, requiring normalizing a direction and scaling it to the size of the minimap. I think the normal set of operations to do would be to do a vector subtraction, a dot product, a square root and then a division (get it, ‘normal’ :)).

3. marco says:

Very interesting post! Looking forward to the 360 results too.

4. Petrocket says:

Fantastic! Can’t wait to try these out – Maybe this is a noob question but what about and processors? Any special sauce there?

5. Elan says:

Petrocket: AMD, you mean? I haven’t got one to measure on, but you could try it. I’ll post my source code for the tests once I’ve cleaned it up. Based on the “Software Optimization Guide for AMD64 Processors” I’d bet the results are similar to those for Intel.

Lee: Testing this in the a vec-norm function is a good idea; I’ll work that up. My goal really was to measure the throughput of sqrt itself, so wrapping it in too much other code starts to make measurement unreliable. What I can tell you for sure though is that unpipelined ops are particularly bad for that case, though, because it prevents the processor from reordering other instructions to run while the sqrt is cooking. rsqrt, on the other hand, is pipelined, so even though it has a six-cycle latency, the fetch stage of the Core Duo can reorganize other work to take place until the results are ready.

DFS, marco: Thanks!

6. Sanjay says:

Nice work. Some minor issues:
1. “no compiler in the business defaults to using the new, faster SSE scalar opcodes” – Apple compilers do (including their version of gcc). OSX has never used x87 by default.

2. “it only takes one step of Newton’s Method” – I think it actually takes two steps. Intel’s RSQRTE is architected to achieve under 12 bits of precision, so in some cases you need 2 steps to get the last bit to match the hardware’s SQRT. You probably still won’t get edge cases and denormals correct with the estimate.

Interesting post, I got inspried by your post and did a comparison between GPU square root routines:

PS: I think Sanjay is correct, the RSQRTE takes two steps (at least it does for AMD)

8. Alex says:

“no compiler in the business defaults to using the new, faster SSE scalar opcodes”

64-bit gcc does by default.

9. rsw says:

By “average error” do you mean you simply took the mean of the signed error? I think RMS error or mean absolute error is a more useful metric here.

Nice writeup either way.

• Elan says:

Mean absolute error, of course. Otherwise I’d just be computing a random walk.

10. S.M says:

Before you bash gcc, did you set the “-mcpu= ” correctly for gcc ??

11. Jason says:

“no compiler in the business defaults to using the new, faster SSE scalar opcodes”

AFAIK all compilers generating x64 code use SSE by default (since all 64 bit x86 CPUs have SSE, this makes sense)

12. fioj says:

“a cryptic obscure command line option (GCC)”

What’s obscure about that command line option? It is well known to all GCC users, as well as the fact that GCC uses SSE floating point maths on amd64, as Alex mentioned.

13. Elan says:

Is someone shipping x64-only games?

14. Elias says:

Here is one technique you apparently haven’t tried: taking the number to the 0.5 power. How does that compare?

15. Elan says:

Elias: 4096 loops over 4096 floats with `x = powf(x, 0.5f)` took 1449.538ms, or 86.4ns/float. This is 3.6 times worse than even the compiler’s naive x87 `sqrt(x)`, and twenty-seven times worse than rsqrtss with one step of Newton-Rhapson iteration (which is equally accurate). Taking an exponent is a function call, and a very slow function call at that.

16. Gregory says:

I’m curious about what’s behind StartClockCycleCounter(); and StopClockCycleCounter();

What’s the best way to benchmark a portion of code on PC? QueryPerformanceCounter (windows only), RTDSC? something else?

And is there a point putting the target code inside a loop?

17. Elan says:

We use rtdsc to implement our cycle counter, since it seems to have the best resolution of any of the timers available.

It’s best to put target code in a loop for the same reason you would weigh rice by the thousand rather than one grain at a time on your kitchen scale: systematic and random error.

First, there is a certain systematic error in querying the cycle counter — the StartCycleCounter() inline function call and the rtdsc op itself have some latency, and the timer is probably only accurate to within a couple of nanoseconds, measuring a single iteration of an 86ns operation would have a large relative error. On the other hand, a relative error of 10-8 in 10-3 seconds is much smaller, and so more accurate.

Also, timings in vivo can be messy: any single iteration of the loop might take a little longer than expected because of other threads, memory bus contention, clock variability, operating system intervention, even CPU temperature. Taking multiple measurements, or a single measurement of multiple iterations, improves statistical significance and narrows the error bars.

18. Gregory says:

Hello Elan,

Thank you for the answer.

Since I posted my comment, I came across http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx so I think I’m going to implement my high precision timer using QPC on Windows and gettimeofday on Linux and Mac.

Or the easy way seems to be Boost.PTime

19. Yuhong Bao says:

“If your customer’s PC has DirectX 9, it has SSE2.”
Not necessarily, as the graphics card choice is independent of the CPU choice.

20. jalf says:

Did you ever get around to cleaning up the source code?

21. Bruce Dawson says:

One part of the reason that the x87 square root is slower is that it is probably calculating square root to double precision. The x87 hardware defaults to calculating square root to 80-bit precision, the C run-time changes the setting so it calculates it to 64-bit precision, but if you change it to calculate it to 32-bit precision (controlfp) then it will be a bit faster. Square root and float-divide timings on x87 are both affected by the selected precision. I don’t think any other operations are.

Of course, changing a global (well, global per-thread) setting in order to get a local optimization is pretty crazy, so your points stand.

22. […] this guy went into it in much more thorough detail if you’re interested!) This entry was posted in […]

23. ns says:

nice article!

sqrt(x) = x*rsqrt(x) is definitely fast but suffers from a NaN problem for x==0.
sqrt(0) = 0*rsqrt(0) = 0*(1/sqrt(0)) = 0*(1/0) = 0*INF = NaN

using the rcp instruction (_mm_rcp_ss) computing 1/x, the square root could be computed as
sqrt(x) = rcp( rsqrt(x) )
This doesn’t produce a NaN
sqrt(0) = rcp(rsqrt(0)) = 1/(1/sqrt(0)) = 1/INF = 0
It’s still very fast (not very accurate though!)

24. CeeG says:

sqrt(y) give this iterate: x += 0.5*(y/x – x)
rsqrt(y) give this iterate: x += 0.5*(x – y*x*x*x)

– No division in the second scheme, maybe this is why rsqrt() is faster both in software and hardware design.

1/y give this iterate: x += x*(1-y*x)

– Maybe this is why sqrt() is an unpipelined instruction (two interleaved iterative loop).

25. Colin Howell says:

I’ve come across this far too late; I found it by chance while looking for other information. But I thought I’d comment anyway, since I saw a couple of misconceptions which I’d like to correct.

As you saw in the Intel optimization manual, the SQRTSS instruction isn’t pipelined. But you interpreted this a little too strongly when you said, “When the CPU hits an unpipelined instruction, every other instruction in the pipeline has to stop and wait for it to retire before proceeding, so it’s like putting the handbrake on your processor.”

The SQRTSS instruction’s effects aren’t quite *that* severe on the Core 2 or its descendants. SQRTSS only uses a single execution unit (the divide/square-root unit) during its processing, and it’s *that unit* which isn’t pipelined. So any further instructions needing that unit (such as additional square root or divide instructions) will be blocked from proceeding until the SQRTSS is finished. And of course, the SQRTSS will delay any instructions which need its result until it is ready. However, other instructions which don’t depend on the SQRTSS result and don’t need the divide unit will still be free to proceed.

You also speculated about the implementation of SQRTSS: “In the case of ssqrt, the processor is probably doing the same thing internally that I’m doing in my “fast” function — taking an estimated reciprocal square root, improving it with Newton’s method, and then multiplying it by the input parameter. Taken all together, this is far too much work to fit into a single execution unit, so the processor stalls until it’s all done.”

While this isn’t an unreasonable method, it’s not what the processor is actually doing. As I said, SQRTSS uses the same functional unit as the divide instructions do, but it uses it in a different way. It turns out that there are digit-by-digit algorithms for computing square root which are quite different from Newton’s method of successive approximations. They more closely resemble conventional long division, except modified to exploit the fact that the algorithm is trying to find a quotient which is also the divisor. Since these square-root algorithms are similar to division algorithms, they can make use of the same hardware with modest modifications, and control lines (set according to the instruction being executed) determine whether a division or a square root is performed.

Unfortunately, the digit-by-digit algorithms commonly used for hardware division and square root are very difficult to pipeline. Thus even today, the divide and square root instructions on Intel x86 processors will basically block that functional unit until the result is complete. Some later processors do a limited amount of overlapping, but not really true pipelining. The biggest way these operations have been sped up over the years has been to improve the divider to generate more quotient bits per cycle (in effect, making each “digit” larger).