Square Roots in vivo: normalizing vectors

Following my earlier article on timing various square-root functions on the x86, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector or normalizing it, like when you need to perform a distance check between two characters to determine whether they can see/shoot/etc each other. So, I wrote up a group of normalize functions, each using a different sqrt technique, and timed them.

The testbed was, as last time, an array of 2048 single-precision floating point numbers, this time interpreted as a packed list of 682 three-dimensional vectors. This number was chosen so that both it and the output array were sure to fit in the L1 cache; however, because three floats add up to twelve bytes, this means that three out of four vectors were not aligned to a 16-byte boundary, which is significant for the SIMD test case as I had to use the movups unaligned load op. Each timing case consisted of looping over this array of vectors 2048 times, normalizing each and writing the result to memory.

Each normalize function computed the length of the vector 1/√(x2 + y2 + z2), multiplied each component by the reciprocal, and then wrote it back through an output pointer. The main difference was in how the reciprocal square root was computed:

  • via the x87 FPU, by simply compiling 1.0f/sqrt( x*x + y*y + z*z )
  • via the SSE scalar unit, by compiling 1.0f/sqrt( x*x + y*y + z*z ) with the /arch:SSE2 option set; this causes the compiler to issue a sqrtss followed by an fdivie, it computes the square root and then divides one by it
  • via the SSE scalar unit, by using the estimated reciprocal square root intrinsic and then performing one step of Newton-Raphson iteration
  • via the SSE SIMD unit, working on the whole vector at once

In all cases the results were accurate to 22 bits of precision. The results for 1,396,736 vector normalizations were:

Method Total time Time per vector
Compiler 1.0/sqrt(x)
x87 FPU FSQRT
52.469ms 37.6ns
Compiler 1.0/sqrt(x)
SSE scalar sqrtss
27.233ms 19.5ns
SSE scalar ops
rsqrtss with one NR step
21.631ms 15.5ns
SSE SIMD ops
rsqrtss with one NR step
20.034ms 14.3ns

Two things jump out here. First, even when the square root op is surrounded by lots of other math — multiplies, adds, loads, stores — optimizations such as this can make a huge difference. It’s not just the cost of the sqrt itself, but also that it’s unpipelined, which means it ties up an execution unit and prevents any other work from being done until it’s entirely completed.

Second, in this case, SIMD is only a very modest benefit. That’s because the input vectors are unaligned, and the two key steps of this operation, the dot product and the square root, are scalar in nature. (This is what’s meant by “horizontal” SIMD computation — operations between the components of one vector, rather than between the corresponding words of two vectors. Given a vector V ∋ <x,y,z>, the sum x + y + z is horizontal, but with two vectors V1 and V2, V3 = <x1+x2, y1+y2, z1+z2> is vertical.) So it really doesn’t play to SIMD’s strengths at all.

On the other hand, if I were to normalize four vectors at a time, so that four dot products and four rsqrts could be performed in parallel in the four words of a vector register, then the speed advantage of SIMD would be much greater. But, again, my goal wasn’t to test performance in tight loops over packed data — it was to figure out the best way to do something like an angle check in the middle of a character’s AI, where you usually deal with one vector at a time.

Source code for my testing functions below the jump. Note that each function writes the normalized vector through an out pointer, but also returns the original vector’s length. The hand-written intrinsic versions probably aren’t totally optimal, but they ought to be good enough to make the point.

[DDET Naive vector normalize, x87 FPU or SSE scalar]
Source

// Normalizes an assumed 3-element vector starting
// at pointer V, and returns the length of the original
// vector.
inline float NaiveTestNormalize( float * RESTRICT vOut, const float * RESTRICT vIn )
{
        const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];
        const float rsqt = 1.0f / sqrt(l);
        vOut[0] = vIn[0] * rsqt;
        vOut[1] = vIn[1] * rsqt;
        vOut[2] = vIn[2] * rsqt;
        return rsqt * l;
}

Assembly (x87 FPU)

_TEXT   SEGMENT
_vOut$ = 8                                              ; size = 4
_vIn$ = 12                                              ; size = 4
?TestNormalize@@YAMPIAMPIBM@Z PROC                      ; TestNormalize, COMDAT
 
; 396  :        const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];
 
        mov     eax, DWORD PTR _vIn$[esp-4]
        fld     DWORD PTR [eax+8]
 
; 397  :        const float rsqt = 1.0f / sqrt(l);
; 398  :        vOut[0] = vIn[0] * rsqt;
 
        mov     ecx, DWORD PTR _vOut$[esp-4]
        fld     DWORD PTR [eax+4]
        fld     DWORD PTR [eax]
        fmul    ST(0), ST(0)
        fld     ST(1)
        fmulp   ST(2), ST(0)
        faddp   ST(1), ST(0)
        fld     ST(1)
        fmulp   ST(2), ST(0)
        faddp   ST(1), ST(0)
        fld     ST(0)
        fsqrt
        fld1
        fdivrp  ST(1), ST(0)
        fld     DWORD PTR [eax]
        fmul    ST(0), ST(1)
        fstp    DWORD PTR [ecx]
 
; 399  :        vOut[1] = vIn[1] * rsqt;
 
        fld     ST(0)
        fmul    DWORD PTR [eax+4]
        fstp    DWORD PTR [ecx+4]
 
; 400  :        vOut[2] = vIn[2] * rsqt;
 
        fld     ST(0)
        fmul    DWORD PTR [eax+8]
        fstp    DWORD PTR [ecx+8]
 
; 401  :        return rsqt * l;
 
        fmulp   ST(1), ST(0)
 
; 402  : }
 
        ret     0
?TestNormalize@@YAMPIAMPIBM@Z ENDP                      ; TestNormalize
_TEXT   ENDS

Assembly (compiler-issued SSE scalar)

_TEXT   SEGMENT
_l$ = -4                                                ; size = 4
_vOut$ = 8                                              ; size = 4
_rsqt$ = 12                                             ; size = 4
_vIn$ = 12                                              ; size = 4
?TestNormalize@@YAMPIAMPIBM@Z PROC                      ; TestNormalize, COMDAT
 
; 392  : {
 
        push    ecx
 
; 393  :        const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];
 
        mov     eax, DWORD PTR _vIn$[esp]
        movss   xmm1, DWORD PTR [eax+4]
        movss   xmm2, DWORD PTR [eax]
        movss   xmm0, DWORD PTR [eax+8]
 
; 394  :        const float rsqt = 1.0f / sqrt(l);
; 395  :        vOut[0] = vIn[0] * rsqt;
 
        mov     eax, DWORD PTR _vOut$[esp]
        movaps  xmm3, xmm2
        mulss   xmm3, xmm2
        movaps  xmm4, xmm1
        mulss   xmm4, xmm1
        addss   xmm3, xmm4
        movaps  xmm4, xmm0
        mulss   xmm4, xmm0
        addss   xmm3, xmm4
        movss   DWORD PTR _l$[esp+4], xmm3
        sqrtss  xmm4, xmm3   ;; slow full-precision square root gets stored in xmm4
        movss   xmm3, DWORD PTR __real@3f800000  ;; store 1.0 in xmm3
        divss   xmm3, xmm4  ;; divide 1.0 / xmm4 to get the reciprocal square root !?!
        movss   DWORD PTR _rsqt$[esp], xmm3
 
; 396  :        vOut[1] = vIn[1] * rsqt;
; 397  :        vOut[2] = vIn[2] * rsqt;
; 398  :        return rsqt * l;
 
        fld     DWORD PTR _rsqt$[esp]
        mulss   xmm2, xmm3
        fmul    DWORD PTR _l$[esp+4]
        mulss   xmm1, xmm3
        mulss   xmm0, xmm3
        movss   DWORD PTR [eax], xmm2
        movss   DWORD PTR [eax+4], xmm1
        movss   DWORD PTR [eax+8], xmm0
 
; 399  : }
 
        pop     ecx
        ret     0
?TestNormalize@@YAMPIAMPIBM@Z ENDP                      ; TestNormalize
_TEXT   ENDS

[/DDET]

[DDET Vector normalize, hand-written SSE scalar by intrinsics]
Source

// SSE scalar reciprocal sqrt using rsqrt op, plus one Newton-Rhaphson iteration
inline __m128 SSERSqrtNR( const __m128 x )
{
	__m128 recip = _mm_rsqrt_ss( x );  // "estimate" opcode
	const static __m128 three = { 3, 3, 3, 3 }; // aligned consts for fast load
	const static __m128 half = { 0.5,0.5,0.5,0.5 };
	__m128 halfrecip = _mm_mul_ss( half, recip );
	__m128 threeminus_xrr = _mm_sub_ss( three, _mm_mul_ss( x, _mm_mul_ss ( recip, recip ) ) );
	return _mm_mul_ss( halfrecip, threeminus_xrr );
}
 
 
inline __m128 SSE_ScalarTestNormalizeFast(  float * RESTRICT vOut, float * RESTRICT vIn )
{
        __m128 x = _mm_load_ss(&vIn[0]);
        __m128 y = _mm_load_ss(&vIn[1]);
        __m128 z = _mm_load_ss(&vIn[2]);
 
        const __m128 l =  // compute x*x + y*y + z*z
                _mm_add_ss(
                 _mm_add_ss( _mm_mul_ss(x,x),
                             _mm_mul_ss(y,y)
                            ),
                 _mm_mul_ss( z, z )
                );
 
 
        const __m128 rsqt = SSERSqrtNR( l );
        _mm_store_ss( &vOut[0] , _mm_mul_ss( rsqt, x ) );
        _mm_store_ss( &vOut[1] , _mm_mul_ss( rsqt, y ) );
        _mm_store_ss( &vOut[2] , _mm_mul_ss( rsqt, z ) );
 
        return _mm_mul_ss( l , rsqt );
}

Assembly

_TEXT   SEGMENT
_vOut$ = 8                                              ; size = 4
_vIn$ = 12                                              ; size = 4
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z PROC ; SSE_ScalarTestNormalizeFast, COMDAT
 
    push    ebp
    mov     ebp, esp
    and     esp, -16                                ; fffffff0H
 
    mov     eax, DWORD PTR _vIn$[ebp]
    movss   xmm0, DWORD PTR [eax]
 
    movss   xmm3, DWORD PTR [eax+4]
 
    movaps  xmm7, XMMWORD PTR ?three@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    movaps  xmm2, xmm0
    movss   xmm0, DWORD PTR [eax+8]
 
    mov     eax, DWORD PTR _vOut$[ebp]
    movaps  xmm4, xmm0
    movaps  xmm0, xmm2
    mulss   xmm0, xmm2
    movaps  xmm1, xmm3
    mulss   xmm1, xmm3
    addss   xmm0, xmm1
    movaps  xmm1, xmm4
    mulss   xmm1, xmm4
    addss   xmm0, xmm1
    movaps  xmm1, xmm0
    rsqrtss xmm1, xmm1
    movaps  xmm5, xmm1
    mulss   xmm1, xmm5
    movaps  xmm6, xmm0
    mulss   xmm6, xmm1
    movaps  xmm1, XMMWORD PTR ?half@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    mulss   xmm1, xmm5
    subss   xmm7, xmm6
    mulss   xmm1, xmm7
    movaps  xmm5, xmm1
    mulss   xmm5, xmm2
    movss   XMMWORD PTR [eax], xmm5
    movaps  xmm2, xmm1
    mulss   xmm2, xmm3
 
    movss   XMMWORD PTR [eax+4], xmm2
    movaps  xmm2, xmm1
    mulss   xmm2, xmm4
 
    movss   XMMWORD PTR [eax+8], xmm2
 
    mulss   xmm0, xmm1
 
    mov     esp, ebp
    pop     ebp
    ret     0
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z ENDP ; SSE_ScalarTestNormalizeFast
_TEXT   ENDS

[/DDET]

[DDET Vector normalize, hand-written SSE SIMD by intrinsics]
Source

inline __m128 SSE_SIMDTestNormalizeFast( float * RESTRICT vOut, float * RESTRICT vIn  )
{
        // load as a SIMD vector
        const __m128 vec = _mm_loadu_ps(vIn);
        // compute a dot product by computing the square, and
        // then rotating the vector and adding, so that the
        // dot ends up in the low term (used by the scalar ops)
        __m128 dot = _mm_mul_ps( vec, vec );
        // rotate x under y and add together   
        __m128 rotated = _mm_shuffle_ps( dot, dot, _MM_SHUFFLE( 0,3,2,1 ) ); // YZWX ( shuffle macro is high to low word )
        dot = _mm_add_ss( dot, rotated ); // x^2 + y^2 in the low word
        rotated = _mm_shuffle_ps( rotated, rotated, _MM_SHUFFLE( 0,3,2,1 ) ); // ZWXY
        dot = _mm_add_ss( dot, rotated ); // x^2 + y^2 + z^2 in the low word
 
        __m128 recipsqrt = SSERSqrtNR( dot ); // contains reciprocal square root in low term
        recipsqrt = _mm_shuffle_ps( recipsqrt, recipsqrt, _MM_SHUFFLE( 0, 0, 0, 0 ) ); // broadcast low term to all words
 
        // multiply 1/sqrt(dotproduct) against all vector components, and write back
        const __m128 normalized = _mm_mul_ps( vec, recipsqrt );
        _mm_storeu_ps(vOut, normalized);
        return _mm_mul_ss( dot , recipsqrt );
}

Assembly

_TEXT   SEGMENT
_vOut$ = 8                                              ; size = 4
_vIn$ = 12                                              ; size = 4
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z PROC   ; SSE_SIMDTestNormalizeFast, COMDAT
 
 
    push    ebp
    mov     ebp, esp
    and     esp, -16                                ; fffffff0H
 
    mov     eax, DWORD PTR _vIn$[ebp]
    movups  xmm2, XMMWORD PTR [eax] ;; load the input vector
    movaps  xmm5, XMMWORD PTR ?three@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B ;; load the constant "3"
    mov     ecx, DWORD PTR _vOut$[ebp]
    movaps  xmm0, xmm2
    mulps   xmm0, xmm2
    movaps  xmm1, xmm0
    shufps  xmm1, xmm0, 57	; shuffle to YZWX
    addss   xmm0, xmm1      ; add Y to low word of xmm0
    shufps  xmm1, xmm1, 57	; shuffle to ZWXY
    addss   xmm0, xmm1      ; add Z to low word of xmm0
 
    movaps  xmm1, xmm0        
    rsqrtss xmm1, xmm1      ; reciprocal square root estimate
    movaps  xmm3, xmm1
    mulss   xmm1, xmm3
    movaps  xmm4, xmm0
    mulss   xmm4, xmm1
    movaps  xmm1, XMMWORD PTR ?half@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    mulss   xmm1, xmm3
    subss   xmm5, xmm4
    mulss   xmm1, xmm5      ; Newton-Raphson finishes here; 1/sqrt(dot) is in xmm1's low word
 
    shufps  xmm1, xmm1, 0   ; broadcast so that xmm1 has 1/sqrt(dot) in all words
    movaps  xmm3, xmm1
    mulps   xmm3, xmm2      ; multiply all words of original vector by 1/sqrt(dot)
    movups  XMMWORD PTR [ecx], xmm3   ; unaligned save to memory
 
	; return dot * 1 / sqrt(dot) == sqrt(dot) == length of vector
    mulss   xmm0, xmm1
 
    mov     esp, ebp
    pop     ebp
    ret     0
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z ENDP   ; SSE_SIMDTestNormalizeFast
_TEXT   ENDS

[/DDET]

10 Comments

  1. Nice work. I just have a dumb question (and maybe you answered it somewhere): If the main reason for square root is proximity detection, why bother, because sqrt(a^2+b^2) < r
    iff (a^2+b^2) < r^2 ?

  2. Elan says:

    Oh, distance checking was just the simple example I used for clarity. In practice the most common use of sqrt by far is to normalize vectors, which we do for a hundred different things: computing lighting, constructing a coordinate basis from local objects, extracting an absolute angle from a dot product, making rotation quaternions, turning distance vectors into a unit-length direction, etc.

    I also occasionally need the sqrt itself to solve a quadratic, like computing an apsis or a ballistic intercept.

  3. Gregory says:

    So you dropped Q3’s fast inverse square root in this benchmark compared to the previous article because it should be around 3.5 times slower compard to SSE rsqrtss (according to the previous benchmark results)?

  4. Soylent says:

    I’m not a game developer, nor do I play one on the internets. I am however a hobbyist SSEx user.

    I wouldn’t have expected you to structure your data in that way. I would have expected either uniform vectors { x, y, z ,w } which allow a 4×4 matrix to represent an affine transformation(e.g. projection, translations) or a structure of arrays(SoA), where the x coordinates, y-coordinates and z-coordinates are kept in separate 16-byte aligned arrays.

    If you use the SoA approach you would do aligned loads from packed x-vectors, packed y-vectors and packed z-vectors and operate on them in exactly the same way you do with the scalar instructions, you just use the corresponding packed scalar instruction(e.g. rsqrtss -> rsqrtps). You probably want to unroll the loop so that you are normalizing 8 vectors at a time. If the number of vectors to normalize isn’t evenly divisible by 8 you can use some scalar code to pick up any stragglers. You can also force the user to allocate a number of floats that is evenly divisible by 8 so that you never have any stragglers(that leaves between 0 and 7 unused vectors at the end of the array; it is probably cheaper to normalize these unused vectors than it is to polluting the instructure cache with extra code).

    If you are forced for some reason to work on 3-component vectors in an AoS memory arrangement then I believe it may still be more efficient to convert from AoS to SoA as you load into the xmm registers(one register containing only x-values, one register containing only y-values and one containing only z-values), do the math and then convert back to AoS and store.

    If eax is a pointer to the current position in the array you could use a sequence like he following to load the vectors:

    movlps xmm0, [eax] // xmm0: { -, -, y0, x0 }.
    movlps xmm1, [eax+24] // xmm1: { -, -, y2, x2 }.
    unpcklps xmm0, [eax+12] // xmm0: { y1, y0, x1, x0}.
    unpcklps xmm1, [eax+36] // xmm1: { y3, y2, x3, x2}.
    movss xmm2, [eax+8] // xmm2: { 0, 0, 0, z0 }.
    movss xmm3, [eax+32] // xmm3: { 0, 0, 0, z2 }.
    unpcklps xmm2, [eax+20] // xmm2: { x2, 0, z1, z0 }.
    unpcklps xmm3, [eax+44] // xmm3: { x4, 0, z3, z1 }. <- note x4.
    movaps xmm4, xmm0;
    movlhps xmm0, xmm1; // xmm0: { x3, x2, x1, x0 }.
    movhlps xmm1, xmm4; // xmm1: { y3, y2, y1, y0 }.
    movlhps xmm2, xmm3; // xmm2: { z3, z2, z1, z0 }.

    You don't want to go off the end of the array(also note that x4 was read above, which can be off the array even if the number of vertices is evenly divisible by 4), so you have to stop and switch over to a scalar routine to pick up the last few elements.

    Do vector normalization in this packed format. When you are done do something like this to store again:

    // xmm0: { x3', x2', x1', x0' }
    // xmm1: { y3', y2', y1', y0' }
    // xmm2: { z3', z2', z1', z0' }
    movaps xmm3, xmm0
    movhlps xmm4, xmm1 // { -, -, z3', z2' }
    unpcklps xmm0, xmm1 // { y1', x1', y0', x0' }
    unpckhps xmm3, xmm1 // { y3', x3', y2', x2' }
    movlps [eax], xmm0 // Store { y0', x0' }
    movss [eax+8], xmm2 // Store z0'
    movhps [eax+12], xmm0 // Store {y1', x1'}
    unpcklps xmm2, xmm2 // { z1', z1', z0', z0' }
    movhps [eax+20], xmm2 // Store z1'.
    movlps [eax+24], xmm3 // Store {y2', x2' }
    movss [eax+32], xmm4 // Store z2'
    movhps [eax+36], xmm3 // Store {y3', x3' }
    shufps xmm4, xmm4, 01010101b // { z3', z3', z3', z3' }
    movss [eax+44], xmm4 // Store z3'.

    Of course, caveat emptor. I haven't tested this code at all. I've just drawn jotted a bit on a piece of paper. It is probably not worth trying to process 8 vectors at a time unless you are in 64-bit mode(8 extra xmm registers).

  5. Elan says:

    Gregory: Yes, the magic number sqrt wasn’t worth trying given its performance previously. It also induces a certain pathological behavior on the Xenon core that makes it a nonstarter.

    Soylent: What you’re suggesting indeed makes sense when one has long lists of packed vectors to transform, but that’s not what I’m timing here. I’m looking at cases where you have to perform a single vector normalization as part of the logic in some larger function — like performing an angle comparison when an AI selects which enemy to target next. Mike Acton might say that even game logic should be packed structures of arrays so that instead of having one entity per AI, you have a big structure of all the positions for all AIs, all velocities for all AIs, all nav state for all AIs, and so on; but games just aren’t built like that yet. We’re still in a world where each rocket is represented by an instance of a CRocket class, and each CRocket has its own Update() function, and each CRocket does its own logic one at a time.

  6. Soylent says:

    I see.

    Assume the following simplified scenario:

    You have a base class for objects that exist in the game world called CEntity that is maybe a couple of hundred bytes or so of data members that describe all its necessary state data, the position, a pair of angles or maybe a transformation matrix, the velocity, whether it has a collision mesh, whether it is static and so on.

    From this class you derive a bunch of CRockets, CPlayers, CPhysicsObject etc. To improve locality of reference you store all your CEntities in a simple array rather than spreading them out in memory. You might just allocate a big enough array that it’s not going to reach capacity under any but pathological cases or you might use some dynamic array that you reallocate to double the size if it’s filled to capacity or reallocate to half the size if emptied to one quarter or something like that.

    In order to give the processor the best possible chance to precache necessary data your update loop calls the Update() method on the CEntity at index zero, then index one and so on in the order they are stored in memory. Each Update() does one normalization among various other stuff.

    Under these conditions, which are about as predictable as possible, will the processor look at the memory access pattern and correctly identify that there’s a 224(or whatever) byte stride and that it is supposed to cache this data ahead of time? If the data is not precached the processor is going to sit idle and wait on RAM for at least several hundred clock cycles in which case it totally swamps the performance of your arithmetic to the point that it barely matters.

  7. Martin says:

    Hi,

    My first posting here so I’ll just start of by saying what an excellent blog site.

    It would be interesting to compare when the SSE versions are not doing non aligned loads. I suspect these days it is worth taking the effort to align vectors properly however I have not challenged this assumption.

  8. […] Square Roots in vivo: normalizing vectors […]

  9. Guillaume says:

    Just a point about your dot product calculation, instead of using shuffle and add, you can sort of combine them all together if you’re a bit clever. Here’s some code from my raytracer:

    This is the dot product so obviously for sqrt v1a and v2a would both be the same thing.

    __m128 res = _mm_mul_ps(v1a, v2a);
    res = _mm_hadd_ps(res, res);
    res = _mm_hadd_ps(res, res);
    return res.m128_f32[0];

  10. […] Fast Cross-Platform SIMD Vector Libraries Becoming a console programmer : Math Libraries Part 2 Square Roots in vivo: normalizing vectors Software optimization […]

Leave a Reply