## Square Roots in vivo: normalizing vectors

Following my earlier article on timing various square-root functions on the x86, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector or normalizing it, like when you need to perform a distance check between two characters to determine whether they can see/shoot/etc each other. So, I wrote up a group of normalize functions, each using a different sqrt technique, and timed them.

The testbed was, as last time, an array of 2048 single-precision floating point numbers, this time interpreted as a packed list of 682 three-dimensional vectors. This number was chosen so that both it and the output array were sure to fit in the L1 cache; however, because three floats add up to twelve bytes, this means that three out of four vectors **were not aligned** to a 16-byte boundary, which is significant for the SIMD test case as I had to use the `movups`

unaligned load op. Each timing case consisted of looping over this array of vectors 2048 times, normalizing each and writing the result to memory.

Each normalize function computed the length of the vector 1/√(x^{2} + y^{2} + z^{2}), multiplied each component by the reciprocal, and then wrote it back through an output pointer. The main difference was in how the reciprocal square root was computed:

- via the x87 FPU, by simply compiling
`1.0f/sqrt( x*x + y*y + z*z )`

- via the SSE scalar unit, by compiling
`1.0f/sqrt( x*x + y*y + z*z )`

with the /arch:SSE2 option set; this causes the compiler to issue a`sqrtss`

followed by an`fdiv`

—*ie*, it computes the square root and then divides one by it - via the SSE scalar unit, by using the estimated reciprocal square root intrinsic and then performing one step of Newton-Raphson iteration
- via the SSE SIMD unit, working on the whole vector at once

In all cases the results were accurate to 22 bits of precision. The results for 1,396,736 vector normalizations were:

Method | Total time | Time per vector |
---|---|---|

Compiler `1.0/sqrt(x)` x87 FPU `FSQRT` |
52.469ms | 37.6ns |

Compiler `1.0/sqrt(x)` SSE scalar `sqrtss` |
27.233ms | 19.5ns |

SSE scalar ops`rsqrtss` with one NR step |
21.631ms | 15.5ns |

SSE SIMD ops `rsqrtss` with one NR step |
20.034ms | 14.3ns |

Two things jump out here. First, even when the square root op is surrounded by lots of other math — multiplies, adds, loads, stores — optimizations such as this can make a huge difference. It’s not just the cost of the sqrt itself, but also that it’s unpipelined, which means it ties up an execution unit and prevents any other work from being done until it’s entirely completed.

Second, in this case, SIMD is only a very modest benefit. That’s because the input vectors are unaligned, and the two key steps of this operation, the dot product and the square root, are scalar in nature. (This is what’s meant by “horizontal” SIMD computation — operations between the components of one vector, rather than between the corresponding words of two vectors. Given a vector V ∋ <x,y,z>, the sum x + y + z is *horizontal*, but with two vectors V_{1} and V_{2}, V_{3} = <x_{1}+x_{2}, y_{1}+y_{2}, z_{1}+z_{2}> is *vertical*.) So it really doesn’t play to SIMD’s strengths at all.

On the other hand, if I were to normalize four vectors at a time, so that four dot products and four rsqrts could be performed in parallel in the four words of a vector register, then the speed advantage of SIMD would be much greater. But, again, my goal wasn’t to test performance in tight loops over packed data — it was to figure out the best way to do something like an angle check in the middle of a character’s AI, where you usually deal with one vector at a time.

Source code for my testing functions below the jump. Note that each function writes the normalized vector through an out pointer, but also returns the original vector’s length. The hand-written intrinsic versions probably aren’t totally optimal, but they ought to be good enough to make the point.

[DDET Naive vector normalize, x87 FPU or SSE scalar]

__Source__

// Normalizes an assumed 3-element vector starting // at pointer V, and returns the length of the original // vector. inline float NaiveTestNormalize( float * RESTRICT vOut, const float * RESTRICT vIn ) { const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2]; const float rsqt = 1.0f / sqrt(l); vOut[0] = vIn[0] * rsqt; vOut[1] = vIn[1] * rsqt; vOut[2] = vIn[2] * rsqt; return rsqt * l; } |

__Assembly (x87 FPU)__

_TEXT SEGMENT _vOut$ = 8 ; size = 4 _vIn$ = 12 ; size = 4 ?TestNormalize@@YAMPIAMPIBM@Z PROC ; TestNormalize, COMDAT ; 396 : const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2]; mov eax, DWORD PTR _vIn$[esp-4] fld DWORD PTR [eax+8] ; 397 : const float rsqt = 1.0f / sqrt(l); ; 398 : vOut[0] = vIn[0] * rsqt; mov ecx, DWORD PTR _vOut$[esp-4] fld DWORD PTR [eax+4] fld DWORD PTR [eax] fmul ST(0), ST(0) fld ST(1) fmulp ST(2), ST(0) faddp ST(1), ST(0) fld ST(1) fmulp ST(2), ST(0) faddp ST(1), ST(0) fld ST(0) fsqrt fld1 fdivrp ST(1), ST(0) fld DWORD PTR [eax] fmul ST(0), ST(1) fstp DWORD PTR [ecx] ; 399 : vOut[1] = vIn[1] * rsqt; fld ST(0) fmul DWORD PTR [eax+4] fstp DWORD PTR [ecx+4] ; 400 : vOut[2] = vIn[2] * rsqt; fld ST(0) fmul DWORD PTR [eax+8] fstp DWORD PTR [ecx+8] ; 401 : return rsqt * l; fmulp ST(1), ST(0) ; 402 : } ret 0 ?TestNormalize@@YAMPIAMPIBM@Z ENDP ; TestNormalize _TEXT ENDS |

__Assembly (compiler-issued SSE scalar)__

_TEXT SEGMENT _l$ = -4 ; size = 4 _vOut$ = 8 ; size = 4 _rsqt$ = 12 ; size = 4 _vIn$ = 12 ; size = 4 ?TestNormalize@@YAMPIAMPIBM@Z PROC ; TestNormalize, COMDAT ; 392 : { push ecx ; 393 : const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2]; mov eax, DWORD PTR _vIn$[esp] movss xmm1, DWORD PTR [eax+4] movss xmm2, DWORD PTR [eax] movss xmm0, DWORD PTR [eax+8] ; 394 : const float rsqt = 1.0f / sqrt(l); ; 395 : vOut[0] = vIn[0] * rsqt; mov eax, DWORD PTR _vOut$[esp] movaps xmm3, xmm2 mulss xmm3, xmm2 movaps xmm4, xmm1 mulss xmm4, xmm1 addss xmm3, xmm4 movaps xmm4, xmm0 mulss xmm4, xmm0 addss xmm3, xmm4 movss DWORD PTR _l$[esp+4], xmm3 sqrtss xmm4, xmm3 ;; slow full-precision square root gets stored in xmm4 movss xmm3, DWORD PTR __real@3f800000 ;; store 1.0 in xmm3 divss xmm3, xmm4 ;; divide 1.0 / xmm4 to get the reciprocal square root !?! movss DWORD PTR _rsqt$[esp], xmm3 ; 396 : vOut[1] = vIn[1] * rsqt; ; 397 : vOut[2] = vIn[2] * rsqt; ; 398 : return rsqt * l; fld DWORD PTR _rsqt$[esp] mulss xmm2, xmm3 fmul DWORD PTR _l$[esp+4] mulss xmm1, xmm3 mulss xmm0, xmm3 movss DWORD PTR [eax], xmm2 movss DWORD PTR [eax+4], xmm1 movss DWORD PTR [eax+8], xmm0 ; 399 : } pop ecx ret 0 ?TestNormalize@@YAMPIAMPIBM@Z ENDP ; TestNormalize _TEXT ENDS |

[/DDET]

[DDET Vector normalize, hand-written SSE scalar by intrinsics]

__Source__

// SSE scalar reciprocal sqrt using rsqrt op, plus one Newton-Rhaphson iteration inline __m128 SSERSqrtNR( const __m128 x ) { __m128 recip = _mm_rsqrt_ss( x ); // "estimate" opcode const static __m128 three = { 3, 3, 3, 3 }; // aligned consts for fast load const static __m128 half = { 0.5,0.5,0.5,0.5 }; __m128 halfrecip = _mm_mul_ss( half, recip ); __m128 threeminus_xrr = _mm_sub_ss( three, _mm_mul_ss( x, _mm_mul_ss ( recip, recip ) ) ); return _mm_mul_ss( halfrecip, threeminus_xrr ); } inline __m128 SSE_ScalarTestNormalizeFast( float * RESTRICT vOut, float * RESTRICT vIn ) { __m128 x = _mm_load_ss(&vIn[0]); __m128 y = _mm_load_ss(&vIn[1]); __m128 z = _mm_load_ss(&vIn[2]); const __m128 l = // compute x*x + y*y + z*z _mm_add_ss( _mm_add_ss( _mm_mul_ss(x,x), _mm_mul_ss(y,y) ), _mm_mul_ss( z, z ) ); const __m128 rsqt = SSERSqrtNR( l ); _mm_store_ss( &vOut[0] , _mm_mul_ss( rsqt, x ) ); _mm_store_ss( &vOut[1] , _mm_mul_ss( rsqt, y ) ); _mm_store_ss( &vOut[2] , _mm_mul_ss( rsqt, z ) ); return _mm_mul_ss( l , rsqt ); } |

__Assembly__

_TEXT SEGMENT _vOut$ = 8 ; size = 4 _vIn$ = 12 ; size = 4 ?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z PROC ; SSE_ScalarTestNormalizeFast, COMDAT push ebp mov ebp, esp and esp, -16 ; fffffff0H mov eax, DWORD PTR _vIn$[ebp] movss xmm0, DWORD PTR [eax] movss xmm3, DWORD PTR [eax+4] movaps xmm7, XMMWORD PTR ?three@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B movaps xmm2, xmm0 movss xmm0, DWORD PTR [eax+8] mov eax, DWORD PTR _vOut$[ebp] movaps xmm4, xmm0 movaps xmm0, xmm2 mulss xmm0, xmm2 movaps xmm1, xmm3 mulss xmm1, xmm3 addss xmm0, xmm1 movaps xmm1, xmm4 mulss xmm1, xmm4 addss xmm0, xmm1 movaps xmm1, xmm0 rsqrtss xmm1, xmm1 movaps xmm5, xmm1 mulss xmm1, xmm5 movaps xmm6, xmm0 mulss xmm6, xmm1 movaps xmm1, XMMWORD PTR ?half@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B mulss xmm1, xmm5 subss xmm7, xmm6 mulss xmm1, xmm7 movaps xmm5, xmm1 mulss xmm5, xmm2 movss XMMWORD PTR [eax], xmm5 movaps xmm2, xmm1 mulss xmm2, xmm3 movss XMMWORD PTR [eax+4], xmm2 movaps xmm2, xmm1 mulss xmm2, xmm4 movss XMMWORD PTR [eax+8], xmm2 mulss xmm0, xmm1 mov esp, ebp pop ebp ret 0 ?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z ENDP ; SSE_ScalarTestNormalizeFast _TEXT ENDS |

[/DDET]

[DDET Vector normalize, hand-written SSE SIMD by intrinsics]

__Source__

inline __m128 SSE_SIMDTestNormalizeFast( float * RESTRICT vOut, float * RESTRICT vIn ) { // load as a SIMD vector const __m128 vec = _mm_loadu_ps(vIn); // compute a dot product by computing the square, and // then rotating the vector and adding, so that the // dot ends up in the low term (used by the scalar ops) __m128 dot = _mm_mul_ps( vec, vec ); // rotate x under y and add together __m128 rotated = _mm_shuffle_ps( dot, dot, _MM_SHUFFLE( 0,3,2,1 ) ); // YZWX ( shuffle macro is high to low word ) dot = _mm_add_ss( dot, rotated ); // x^2 + y^2 in the low word rotated = _mm_shuffle_ps( rotated, rotated, _MM_SHUFFLE( 0,3,2,1 ) ); // ZWXY dot = _mm_add_ss( dot, rotated ); // x^2 + y^2 + z^2 in the low word __m128 recipsqrt = SSERSqrtNR( dot ); // contains reciprocal square root in low term recipsqrt = _mm_shuffle_ps( recipsqrt, recipsqrt, _MM_SHUFFLE( 0, 0, 0, 0 ) ); // broadcast low term to all words // multiply 1/sqrt(dotproduct) against all vector components, and write back const __m128 normalized = _mm_mul_ps( vec, recipsqrt ); _mm_storeu_ps(vOut, normalized); return _mm_mul_ss( dot , recipsqrt ); } |

__Assembly__

_TEXT SEGMENT _vOut$ = 8 ; size = 4 _vIn$ = 12 ; size = 4 ?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z PROC ; SSE_SIMDTestNormalizeFast, COMDAT push ebp mov ebp, esp and esp, -16 ; fffffff0H mov eax, DWORD PTR _vIn$[ebp] movups xmm2, XMMWORD PTR [eax] ;; load the input vector movaps xmm5, XMMWORD PTR ?three@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B ;; load the constant "3" mov ecx, DWORD PTR _vOut$[ebp] movaps xmm0, xmm2 mulps xmm0, xmm2 movaps xmm1, xmm0 shufps xmm1, xmm0, 57 ; shuffle to YZWX addss xmm0, xmm1 ; add Y to low word of xmm0 shufps xmm1, xmm1, 57 ; shuffle to ZWXY addss xmm0, xmm1 ; add Z to low word of xmm0 movaps xmm1, xmm0 rsqrtss xmm1, xmm1 ; reciprocal square root estimate movaps xmm3, xmm1 mulss xmm1, xmm3 movaps xmm4, xmm0 mulss xmm4, xmm1 movaps xmm1, XMMWORD PTR ?half@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B mulss xmm1, xmm3 subss xmm5, xmm4 mulss xmm1, xmm5 ; Newton-Raphson finishes here; 1/sqrt(dot) is in xmm1's low word shufps xmm1, xmm1, 0 ; broadcast so that xmm1 has 1/sqrt(dot) in all words movaps xmm3, xmm1 mulps xmm3, xmm2 ; multiply all words of original vector by 1/sqrt(dot) movups XMMWORD PTR [ecx], xmm3 ; unaligned save to memory ; return dot * 1 / sqrt(dot) == sqrt(dot) == length of vector mulss xmm0, xmm1 mov esp, ebp pop ebp ret 0 ?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z ENDP ; SSE_SIMDTestNormalizeFast _TEXT ENDS |

[/DDET]

Nice work. I just have a dumb question (and maybe you answered it somewhere): If the main reason for square root is proximity detection, why bother, because sqrt(a^2+b^2) < r

iff (a^2+b^2) < r^2 ?

Oh, distance checking was just the simple example I used for clarity. In practice the most common use of sqrt by far is to normalize vectors, which we do for a hundred different things: computing lighting, constructing a coordinate basis from local objects, extracting an absolute angle from a dot product, making rotation quaternions, turning distance vectors into a unit-length direction, etc.

I also occasionally need the sqrt itself to solve a quadratic, like computing an apsis or a ballistic intercept.

So you dropped Q3’s fast inverse square root in this benchmark compared to the previous article because it should be around 3.5 times slower compard to SSE rsqrtss (according to the previous benchmark results)?

I’m not a game developer, nor do I play one on the internets. I am however a hobbyist SSEx user.

I wouldn’t have expected you to structure your data in that way. I would have expected either uniform vectors { x, y, z ,w } which allow a 4×4 matrix to represent an affine transformation(e.g. projection, translations) or a structure of arrays(SoA), where the x coordinates, y-coordinates and z-coordinates are kept in separate 16-byte aligned arrays.

If you use the SoA approach you would do aligned loads from packed x-vectors, packed y-vectors and packed z-vectors and operate on them in exactly the same way you do with the scalar instructions, you just use the corresponding packed scalar instruction(e.g. rsqrtss -> rsqrtps). You probably want to unroll the loop so that you are normalizing 8 vectors at a time. If the number of vectors to normalize isn’t evenly divisible by 8 you can use some scalar code to pick up any stragglers. You can also force the user to allocate a number of floats that is evenly divisible by 8 so that you never have any stragglers(that leaves between 0 and 7 unused vectors at the end of the array; it is probably cheaper to normalize these unused vectors than it is to polluting the instructure cache with extra code).

If you are forced for some reason to work on 3-component vectors in an AoS memory arrangement then I believe it may still be more efficient to convert from AoS to SoA as you load into the xmm registers(one register containing only x-values, one register containing only y-values and one containing only z-values), do the math and then convert back to AoS and store.

If eax is a pointer to the current position in the array you could use a sequence like he following to load the vectors:

movlps xmm0, [eax] // xmm0: { -, -, y0, x0 }.

movlps xmm1, [eax+24] // xmm1: { -, -, y2, x2 }.

unpcklps xmm0, [eax+12] // xmm0: { y1, y0, x1, x0}.

unpcklps xmm1, [eax+36] // xmm1: { y3, y2, x3, x2}.

movss xmm2, [eax+8] // xmm2: { 0, 0, 0, z0 }.

movss xmm3, [eax+32] // xmm3: { 0, 0, 0, z2 }.

unpcklps xmm2, [eax+20] // xmm2: { x2, 0, z1, z0 }.

unpcklps xmm3, [eax+44] // xmm3: { x4, 0, z3, z1 }. <- note x4.

movaps xmm4, xmm0;

movlhps xmm0, xmm1; // xmm0: { x3, x2, x1, x0 }.

movhlps xmm1, xmm4; // xmm1: { y3, y2, y1, y0 }.

movlhps xmm2, xmm3; // xmm2: { z3, z2, z1, z0 }.

You don't want to go off the end of the array(also note that x4 was read above, which can be off the array even if the number of vertices is evenly divisible by 4), so you have to stop and switch over to a scalar routine to pick up the last few elements.

Do vector normalization in this packed format. When you are done do something like this to store again:

// xmm0: { x3', x2', x1', x0' }

// xmm1: { y3', y2', y1', y0' }

// xmm2: { z3', z2', z1', z0' }

movaps xmm3, xmm0

movhlps xmm4, xmm1 // { -, -, z3', z2' }

unpcklps xmm0, xmm1 // { y1', x1', y0', x0' }

unpckhps xmm3, xmm1 // { y3', x3', y2', x2' }

movlps [eax], xmm0 // Store { y0', x0' }

movss [eax+8], xmm2 // Store z0'

movhps [eax+12], xmm0 // Store {y1', x1'}

unpcklps xmm2, xmm2 // { z1', z1', z0', z0' }

movhps [eax+20], xmm2 // Store z1'.

movlps [eax+24], xmm3 // Store {y2', x2' }

movss [eax+32], xmm4 // Store z2'

movhps [eax+36], xmm3 // Store {y3', x3' }

shufps xmm4, xmm4, 01010101b // { z3', z3', z3', z3' }

movss [eax+44], xmm4 // Store z3'.

Of course, caveat emptor. I haven't tested this code at all. I've just drawn jotted a bit on a piece of paper. It is probably not worth trying to process 8 vectors at a time unless you are in 64-bit mode(8 extra xmm registers).

Gregory: Yes, the magic number sqrt wasn’t worth trying given its performance previously. It also induces a certain pathological behavior on the Xenon core that makes it a nonstarter.

Soylent: What you’re suggesting indeed makes sense when one has long lists of packed vectors to transform, but that’s not what I’m timing here. I’m looking at cases where you have to perform a single vector normalization as part of the logic in some larger function — like performing an angle comparison when an AI selects which enemy to target next. Mike Acton might say that even game logic should be packed structures of arrays so that instead of having one entity per AI, you have a big structure of all the positions for all AIs, all velocities for all AIs, all nav state for all AIs, and so on; but games just aren’t built like that yet. We’re still in a world where each rocket is represented by an instance of a CRocket class, and each CRocket has its own Update() function, and each CRocket does its own logic one at a time.

I see.

Assume the following simplified scenario:

You have a base class for objects that exist in the game world called CEntity that is maybe a couple of hundred bytes or so of data members that describe all its necessary state data, the position, a pair of angles or maybe a transformation matrix, the velocity, whether it has a collision mesh, whether it is static and so on.

From this class you derive a bunch of CRockets, CPlayers, CPhysicsObject etc. To improve locality of reference you store all your CEntities in a simple array rather than spreading them out in memory. You might just allocate a big enough array that it’s not going to reach capacity under any but pathological cases or you might use some dynamic array that you reallocate to double the size if it’s filled to capacity or reallocate to half the size if emptied to one quarter or something like that.

In order to give the processor the best possible chance to precache necessary data your update loop calls the Update() method on the CEntity at index zero, then index one and so on in the order they are stored in memory. Each Update() does one normalization among various other stuff.

Under these conditions, which are about as predictable as possible, will the processor look at the memory access pattern and correctly identify that there’s a 224(or whatever) byte stride and that it is supposed to cache this data ahead of time? If the data is not precached the processor is going to sit idle and wait on RAM for at least several hundred clock cycles in which case it totally swamps the performance of your arithmetic to the point that it barely matters.

Hi,

My first posting here so I’ll just start of by saying what an excellent blog site.

It would be interesting to compare when the SSE versions are not doing non aligned loads. I suspect these days it is worth taking the effort to align vectors properly however I have not challenged this assumption.

[…] Square Roots in vivo: normalizing vectors […]

Just a point about your dot product calculation, instead of using shuffle and add, you can sort of combine them all together if you’re a bit clever. Here’s some code from my raytracer:

This is the dot product so obviously for sqrt v1a and v2a would both be the same thing.

__m128 res = _mm_mul_ps(v1a, v2a);

res = _mm_hadd_ps(res, res);

res = _mm_hadd_ps(res, res);

return res.m128_f32[0];

[…] Fast Cross-Platform SIMD Vector Libraries Becoming a console programmer : Math Libraries Part 2 Square Roots in vivo: normalizing vectors Software optimization […]