Following my earlier article on timing various square-root functions on the x86, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector or normalizing it, like when you need to perform a distance check between two characters to determine whether they can see/shoot/etc each other. So, I wrote up a group of normalize functions, each using a different sqrt technique, and timed them.

The testbed was, as last time, an array of 2048 single-precision floating point numbers, this time interpreted as a packed list of 682 three-dimensional vectors. This number was chosen so that both it and the output array were sure to fit in the L1 cache; however, because three floats add up to twelve bytes, this means that three out of four vectors **were not aligned** to a 16-byte boundary, which is significant for the SIMD test case as I had to use the `movups`

unaligned load op. Each timing case consisted of looping over this array of vectors 2048 times, normalizing each and writing the result to memory.

Each normalize function computed the length of the vector 1/√(x^{2} + y^{2} + z^{2}), multiplied each component by the reciprocal, and then wrote it back through an output pointer. The main difference was in how the reciprocal square root was computed:

- via the x87 FPU, by simply compiling
`1.0f/sqrt( x*x + y*y + z*z )`

- via the SSE scalar unit, by compiling
`1.0f/sqrt( x*x + y*y + z*z )`

with the /arch:SSE2 option set; this causes the compiler to issue a`sqrtss`

followed by an`fdiv`

—*ie*, it computes the square root and then divides one by it - via the SSE scalar unit, by using the estimated reciprocal square root intrinsic and then performing one step of Newton-Raphson iteration
- via the SSE SIMD unit, working on the whole vector at once

In all cases the results were accurate to 22 bits of precision. The results for 1,396,736 vector normalizations were:

Method | Total time | Time per vector |
---|---|---|

Compiler `1.0/sqrt(x)` x87 FPU `FSQRT` |
52.469ms | 37.6ns |

Compiler `1.0/sqrt(x)` SSE scalar `sqrtss` |
27.233ms | 19.5ns |

SSE scalar ops`rsqrtss` with one NR step |
21.631ms | 15.5ns |

SSE SIMD ops `rsqrtss` with one NR step |
20.034ms | 14.3ns |

Two things jump out here. First, even when the square root op is surrounded by lots of other math — multiplies, adds, loads, stores — optimizations such as this can make a huge difference. It’s not just the cost of the sqrt itself, but also that it’s unpipelined, which means it ties up an execution unit and prevents any other work from being done until it’s entirely completed.

Second, in this case, SIMD is only a very modest benefit. That’s because the input vectors are unaligned, and the two key steps of this operation, the dot product and the square root, are scalar in nature. (This is what’s meant by “horizontal” SIMD computation — operations between the components of one vector, rather than between the corresponding words of two vectors. Given a vector V ∋ <x,y,z>, the sum x + y + z is *horizontal*, but with two vectors V_{1} and V_{2}, V_{3} = <x_{1}+x_{2}, y_{1}+y_{2}, z_{1}+z_{2}> is *vertical*.) So it really doesn’t play to SIMD’s strengths at all.

On the other hand, if I were to normalize four vectors at a time, so that four dot products and four rsqrts could be performed in parallel in the four words of a vector register, then the speed advantage of SIMD would be much greater. But, again, my goal wasn’t to test performance in tight loops over packed data — it was to figure out the best way to do something like an angle check in the middle of a character’s AI, where you usually deal with one vector at a time.

Source code for my testing functions below the jump. Note that each function writes the normalized vector through an out pointer, but also returns the original vector’s length. The hand-written intrinsic versions probably aren’t totally optimal, but they ought to be good enough to make the point.

Continue reading ‘Square Roots *in vivo*: normalizing vectors’ »