The square root is one of those basic mathematical operations that’s totally ubiquitous in any game’s source code, and yet also has many competing implementations and performance superstitions around it. The compiler offers a sqrt() builtin function, and so do some CPUs, but some programmers insist on writing their own routines in software. And often it’s really the reciprocal square root you want, for normalizing a vector, or trigonometry. But I’ve never had a clear answer for which technique is really fastest, or exactly what accuracy-vs-speed tradeoffs we make with “estimating” intrinsics.

What is the fastest way to compute a square root? It would seem that if the CPU has a native square-root opcode, there’s no beating the hardware, but is it really true?

Such questions vex me, so I went and *measured* all the different means of computing the square root of a scalar single-precision floating point number that I could think of. I ran trials on my Intel Core 2 and on the Xenon, comparing each technique for both speed and accuracy, and some of the results were surprising!

In this article I’ll describe my results for the Intel hardware; next week I’ll turn to the Xenon PPC.

## Experimental setup

I’ll post the whole source code for my tests elsewhere, but basically each of these trials consists of iterating N times over an array of floating point numbers, calling square root upon each of them and writing it to a second output array.

[DDET (see pseudocode)]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| inline float TestedFunction( float x )
{
return sqrt(x); // one of many implementations..
}
void TimeSquareRoot()
{
float numbersIn[ ARRAYSIZE ]; // ARRAYSIZE chosen so that both arrays
float numbersOut[ ARRAYSIZE ]; // fit in L1 cache
// assume that numbersIn is filled with random positive numbers, and both arrays are
// prefetched to cache...
StartClockCycleCounter();
for ( int i = 0 ; i < NUMITERATIONS ; ++i )
for ( int j = 0 ; j < ARRAYSIZE ; ++j ) // in some cases I unroll this loop
{
numbersOut[j] = TestedFunction( numbersIn[j] );
}
StopClockCycleCounter();
printf( "%.3f millisec for %d floatsn",
ClockCycleCounterInMilliseconds(), ARRAYSIZE * NUMITERATIONS );
// now measure accuracy
float error = 0;
for ( int i = 0 ; i < ARRAYSIZE ; ++i )
{
double knownAccurate = PerfectSquareRoot( numbersIn[i] );
error += fabs( numbersOut[i] - knownAccurate ) / knownAccurate ;
}
error /= ARRAYSIZE ;
printf( "Average error: %.5f%%n", error * 100.0f );
} |

[/DDET]

In each case I verified that the compiler was not eliding any computations (it really was performing ARRAYSIZE × NUMITERATIONS many square roots), that it was properly inlining the tested function, and that all the arrays fit into L1 cache so that memory latency wasn’t affecting the results. I also **only tested ***scalar* square root functions — SIMD would clearly be the fastest way of working on large contiguous arrays, but I wanted to measure the different techniques of computing **one** square root at a time, as is usually necessary in gameplay code.

Because some of the speedup techniques involve trading off accuracy, I compared the resulting numbers against the perfectly-accurate double-precision square root library routine to get an average error for each test run.

And I performed each run multiple times with different data, averaging the final results together.

## x86 results

I ran my tests on a 2.66Ghz Intel Core 2 workstation. An x86 chip actually has two different means of performing scalar floating-point math. By default, the compiler uses the old x87 FPU, which dates back to 1980 with a stack-based instruction set like one of those old RPN calculators. In 1999, Intel introduced SSE, which added a variety of new instructions to the processor. SSE is mostly thought of as a SIMD instruction set — for operating on four 32-bit floats in a single op — but it also includes an entire set of *scalar* floating point instructions that operate on only one float at a time. It’s faster than the x87 operations and was meant to deprecate the old x87 pathway. However, both the MSVC and GCC compilers default to exclusively using the x87 for scalar math, so unless you edit the “code generation” project properties panel (MSVC) or provide a cryptic obscure command line option (GCC), you’ll be stuck with code that uses the old slow way.

I timed the following techniques for square root:

- The compiler’s built in
`sqrt()`

function (which compiled to the x87 FSQRT opcode)
- The SSE “scalar single square root” opcode
`sqrtss`

, which MSVC emits if you use the `_mm_sqrt_ss`

intrinsic or if you set `/arch:SSE2`

- The “magic number” approximation technique invented by Greg Walsh at Ardent Computer and made famous by John Carmack in the Quake III source code.
- Taking the estimated
*reciprocal* square root of *a* via the SSE opcode `rsqrtss`

, and multiplying it against *a* to get the square root via the identity x / √x = √x.
- Method (4), with one additional step of Newton-Raphson iteration to improve accuracy.
- Method (5), with the loop at line 13 of the pseudocode above unrolled to process four floats per iteration.

I also tested three ways of getting the *reciprocal* square root: Carmack’s technique, the `rsqrtss`

SSE op via compiler intrinsic, and `rsqrtss`

with one Newton-Raphson step.

The results, for 4096 loops over 4096 single-precision floats, were:

__SQUARE ROOT__
Method |
Total time |
Time per float |
Avg Error |

Compiler `sqrt(x)` / x87 FPU `FSQRT` |
404.029ms |
24ns |
0.0000% |

SSE intrinsic `ssqrts` |
200.395ms |
11.9ns |
0.0000% |

Carmack’s Magic Number rsqrt * x |
72.682ms |
4.33ns |
0.0990% |

SSE `rsqrtss` * x |
20.495ms |
1.22ns |
0.0094% |

SSE `rsqrtss` * x with one NR step |
53.401ms |
3.18ns |
0.0000% |

SSE `rsqrtss` * x with one NR step, unrolled by four |
48.701ms |
2.90ns |
0.0000% |

__RECIPROCAL SQRT__
Method |
Total time |
Time per float |
Avg Error |

Carmack’s Magic Number rsqrt |
59.378ms |
3.54ns |
0.0990% |

SSE `rsqrtss` |
14.202ms |
0.85ns |
0.0094% |

SSE `rsqrtss` with one NR step |
45.952ms |
2.74ns |
0.0000% |

## Discussion

Looking at these results, it’s clear that there’s a dramatic difference in performance between different approaches to performing square root; which one you choose really can have a significant impact on framerate and accuracy. My conclusions are:

**Don’t trust the compiler to do the right thing.** The received wisdom on performance in math functions is usually “don’t reinvent the wheel; the library and compiler are smart and optimal.” We see here that *this is completely wrong*, and in fact calling the library `sqrt(x)`

causes the compiler to do exactly the worst possible thing. The compiler’s output for ` y = sqrt(x); `

is worse by orders of magnitude compared to any other approach tested here.

**The x87 FPU is really very slow.** Intel has been trying to deprecate the old x87 FPU instructions for a decade now, but no compiler in the business defaults to using the new, faster SSE scalar opcodes in place of emulating a thirty-year-old 8087. In the case of ` y = sqrt(x) `

, by default MSVC and GCC emit something like

fld DWORD PTR [ecx]
fsqrt ;; slow x87 flop
fstp DWORD PTR [eax] |

But if I set the `/arch:SSE2`

option flag, telling the compiler “assume this code will run on a machine with SSE2″, it will instead emit the following, which is 2x faster.

sqrtss xmm0, DWORD PTR [ecx] ;; faster SSE scalar flop
movss DWORD PTR [eax], xmm0 |

There was a time when not every PC on the market had SSE2, meaning that there was some sense in using the older, more backwards-compatible operations, but that time has long since passed. SSE2 was introduced *in 2001 with the Pentium 4*. No one is ever going to try to play your game on a machine that doesn’t support it. If your customer’s PC has DirectX 9, it has SSE2.

**You ***can* beat the hardware. The most surprising thing about these results for me was that it is faster to take a reciprocal square root and multiply it, than it is to use the native sqrt opcode, by an order of magnitude. Even Carmack’s trick, which I had assumed was obsolete in an age of deep pipelines and load-hit-stores, proved faster than the native SSE scalar op. Part of this is that the reciprocal sqrt opcode `rsqrtss`

is an estimate, accurate to twelve bits; but it only takes one step of Newton’s Method to converge that estimate to an accuracy of 24 bits while still being four times faster than the hardware square root opcode.

The question that then bothered me was, *why* is SSE’s built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations? The first hint came when I tried unrolling the loop so that it performed four ops inside the inner for():

for ( int i = 0 ; i < NUMITERATIONS ; ++i )
for ( int j = 0 ; j < ARRAYSIZE ; j += 4 ) // in some cases I unroll this loop
{
numbersOut[j + 0] = TestedSqrt( numbersIn[j + 0] );
numbersOut[j + 1] = TestedSqrt( numbersIn[j + 1] );
numbersOut[j + 2] = TestedSqrt( numbersIn[j + 2] );
numbersOut[j + 3] = TestedSqrt( numbersIn[j + 3] );
}
// two implementations of |

As you can see from the results above, when TestedSqrt was the `rsqrtss`

followed by a multiply and one step of Newton iteration, unrolling the loop this way provided a modest 8.8% improvement in speed. But when I tried the same thing with the “precise square root” op `sqrtss`

, the difference was negligible:

SSE sqrt: 200.395 msec
average error 0.0000%
SSE sqrt, unrolled four: 196.741 msec
average error 0.0000%

What this suggests is that unrolling the loop this way allowed the four rsqrt paths to be pipelined, so that while an individual `rsqrtss`

might take 6 cycles to execute before its result was ready, other work could proceed during that time so that the four square root operations in the loop overlapped. On the other hand, the non-estimated `sqrtss`

op apparently cannot be overlapped; one sqrt must finish before the next can begin. A look at the IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual confirms: `sqrtss`

is an unpipelined instruction.

**Pipelined operations make a big difference.** When the CPU hits an unpipelined instruction, every other instruction in the pipeline has to stop and wait for it to retire before proceeding, so it’s like putting the handbrake on your processor. You can identify nonpipelined operations in appendix C of the Optimization Reference Manual as the ones that have a throughput equal to latency and greater than 4 cycles.

In the case of `ssqrt`

, the processor is probably doing the same thing internally that I’m doing in my “fast” function — taking an estimated reciprocal square root, improving it with Newton’s method, and then multiplying it by the input parameter. Taken all together, this is far too much work to fit into a single execution unit, so the processor stalls until it’s all done. But if you break up the work so that each of those steps is its own instruction, then the CPU can pipeline them all, and get a much higher *throughput* even if the latency is the same.

Pipeline latency and microcoded instructions are *a much bigger* deal on the 360 and PS3, whose CPUs don’t reorder operations to hide bubbles; there the benefit from unrolling is much greater, as you’ll see next week.

## Conclusion

Not all square root functions are created equal, and writing your own can have very real performance benefits over trusting the compiler to optimize your code for you (at which it fails miserably). In many cases you can trade off some accuracy for a massive increase in speed, but even in those places where you need full accuracy, writing your own function to leverage the `rsqrtss`

op followed by Newton’s method can still give you 32 bits of precision at a 4x-8x improvement over what you will get with the built-in `sqrtf()`

function.

And if you have *lots* of numbers you need to square root, of course SIMD (`rsqrtps`

) will be faster still.