Following my earlier article on timing various square-root functions on the x86 Clomid For Sale, , commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector or normalizing it, like when you need to perform a distance check between two characters to determine whether they can see/shoot/etc each other. So, I wrote up a group of normalize functions, each using a different sqrt technique, Cheap Clomid no rx, and timed them.

The testbed was, as last time, an array of 2048 single-precision floating point numbers, this time interpreted as a packed list of 682 three-dimensional vectors. This number was chosen so that both it and the output array were sure to fit in the L1 cache; however, because three floats add up to twelve bytes, is Clomid addictive, this means that three out of four vectors were not aligned to a 16-byte boundary, which is significant for the SIMD test case as I had to use the movups unaligned load op. Each timing case consisted of looping over this array of vectors 2048 times, normalizing each and writing the result to memory, Clomid For Sale.

Each normalize function computed the length of the vector 1/√(x2 + y2 + z2), multiplied each component by the reciprocal, and then wrote it back through an output pointer. Get Clomid, The main difference was in how the reciprocal square root was computed:

  • via the x87 FPU, by simply compiling 1.0f/sqrt( x*x + y*y + z*z )

  • via the SSE scalar unit, by compiling 1.0f/sqrt( x*x + y*y + z*z ) with the /arch:SSE2 option set; this causes the compiler to issue a sqrtss followed by an fdivie, it computes the square root and then divides one by it
  • via the SSE scalar unit, by using the estimated reciprocal square root intrinsic and then performing one step of Newton-Raphson iteration

  • via the SSE SIMD unit, working on the whole vector at once

In all cases the results were accurate to 22 bits of precision. The results for 1, doses Clomid work,396,736 vector normalizations were:

MethodTotal timeTime per vector
Compiler 1.0/sqrt(x)
Compiler 1.0/sqrt(x)
SSE scalar sqrtss
SSE scalar ops
rsqrtss with one NR step
rsqrtss with one NR step

Two things jump out here. First, even when the square root op is surrounded by lots of other math — multiplies, adds, Clomid use, loads, stores — optimizations such as this can make a huge difference. Clomid For Sale, It's not just the cost of the sqrt itself, but also that it's unpipelined, which means it ties up an execution unit and prevents any other work from being done until it's entirely completed.

Second, in this case, SIMD is only a very modest benefit. That's because the input vectors are unaligned, and the two key steps of this operation, online buying Clomid, the dot product and the square root, are scalar in nature. (This is what's meant by "horizontal" SIMD computation — operations between the components of one vector, rather than between the corresponding words of two vectors. Given a vector V ∋ <x, Comprar en línea Clomid, comprar Clomid baratos, y,z>, the sum x + y + z is horizontal, but with two vectors V1 and V2, V3 = <x1+x2, y1+y2, z1+z2> is vertical.) So it really doesn't play to SIMD's strengths at all, Clomid price.

On the other hand, if I were to normalize four vectors at a time, so that four dot products and four rsqrts could be performed in parallel in the four words of a vector register, then the speed advantage of SIMD would be much greater, Clomid For Sale. But, again, my goal wasn't to test performance in tight loops over packed data — it was to figure out the best way to do something like an angle check in the middle of a character's AI, where you usually deal with one vector at a time.

Source code for my testing functions below the jump. Order Clomid from United States pharmacy, Note that each function writes the normalized vector through an out pointer, but also returns the original vector's length. The hand-written intrinsic versions probably aren't totally optimal, but they ought to be good enough to make the point.

[DDET Naive vector normalize, x87 FPU or SSE scalar]

// Normalizes an assumed 3-element vector starting
// at pointer V, and returns the length of the original
// vector, Clomid used for.
inline float NaiveTestNormalize( float * RESTRICT vOut, const float * RESTRICT vIn )
const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];
const float rsqt = 1.0f / sqrt(l);
vOut[0] = vIn[0] * rsqt;
vOut[1] = vIn[1] * rsqt;
vOut[2] = vIn[2] * rsqt;
return rsqt * l;

Assembly (x87 FPU)

_vOut$ = 8 ; size = 4
_vIn$ = 12 ; size = 4
?TestNormalize@@YAMPIAMPIBM@Z PROC ; TestNormalize, COMDAT

; 396 : const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];

mov eax, DWORD PTR _vIn$[esp-4]
fld DWORD PTR [eax+8]

; 397 : const float rsqt = 1.0f / sqrt(l);
; 398 : vOut[0] = vIn[0] * rsqt;

mov ecx, DWORD PTR _vOut$[esp-4]
fld DWORD PTR [eax+4]
fld DWORD PTR [eax]
fmul ST(0), Order Clomid online overnight delivery no prescription, ST(0)
fld ST(1)
fmulp ST(2), ST(0)
faddp ST(1), ST(0)
fld ST(1)
fmulp ST(2), ST(0)
faddp ST(1), ST(0)
fld ST(0)
fdivrp ST(1), ST(0)
fld DWORD PTR [eax]
fmul ST(0), ST(1)
fstp DWORD PTR [ecx]

; 399 : vOut[1] = vIn[1] * rsqt;

fld ST(0)
fmul DWORD PTR [eax+4]
fstp DWORD PTR [ecx+4]

; 400 : vOut[2] = vIn[2] * rsqt;

fld ST(0)
fmul DWORD PTR [eax+8]
fstp DWORD PTR [ecx+8]

; 401 : return rsqt * l;

fmulp ST(1), Clomid mg, ST(0)

; 402 : }

ret 0
?TestNormalize@@YAMPIAMPIBM@Z ENDP ; TestNormalize

Assembly (compiler-issued SSE scalar)

_l$ = -4 ; size = 4
_vOut$ = 8 ; size = 4
_rsqt$ = 12 ; size = 4
_vIn$ = 12 ; size = 4
?TestNormalize@@YAMPIAMPIBM@Z PROC ; TestNormalize, COMDAT

; 392 : {

push ecx

; 393 : const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];

mov eax, DWORD PTR _vIn$[esp]
movss xmm1, DWORD PTR [eax+4]
movss xmm2, DWORD PTR [eax]
movss xmm0, Buying Clomid online over the counter, DWORD PTR [eax+8]

; 394 : const float rsqt = 1.0f / sqrt(l);
; 395 : vOut[0] = vIn[0] * rsqt;

mov eax, DWORD PTR _vOut$[esp]
movaps xmm3, xmm2
mulss xmm3, xmm2
movaps xmm4, xmm1
mulss xmm4, xmm1
addss xmm3, xmm4
movaps xmm4, low dose Clomid, xmm0
mulss xmm4, xmm0
addss xmm3, xmm4
movss DWORD PTR _l$[esp+4], xmm3
sqrtss xmm4, xmm3 ;; slow full-precision square root gets stored in xmm4
movss xmm3, Purchase Clomid online no prescription, DWORD PTR __real@3f800000 ;; store 1.0 in xmm3
divss xmm3, xmm4 ;; divide 1.0 / xmm4 to get the reciprocal square root !?.
movss DWORD PTR _rsqt$[esp], xmm3

; 396 : vOut[1] = vIn[1] * rsqt;
; 397 : vOut[2] = vIn[2] * rsqt;
; 398 : return rsqt * l;

fld DWORD PTR _rsqt$[esp]
mulss xmm2, xmm3
fmul DWORD PTR _l$[esp+4]
mulss xmm1, xmm3
mulss xmm0, xmm3
movss DWORD PTR [eax], where can i order Clomid without prescription, xmm2
movss DWORD PTR [eax+4], xmm1
movss DWORD PTR [eax+8], xmm0

; 399 : }

pop ecx
ret 0
?TestNormalize@@YAMPIAMPIBM@Z ENDP ; TestNormalize


[DDET Vector normalize, hand-written SSE scalar by intrinsics]

// SSE scalar reciprocal sqrt using rsqrt op, plus one Newton-Rhaphson iteration
inline __m128 SSERSqrtNR( const __m128 x )
__m128 recip = _mm_rsqrt_ss( x ); // "estimate" opcode
const static __m128 three = { 3, Generic Clomid, 3, 3, 3 }; // aligned consts for fast load
const static __m128 half = { 0.5,0.5,0.5,0.5 };
__m128 halfrecip = _mm_mul_ss( half, recip );
__m128 threeminus_xrr = _mm_sub_ss( three, Clomid samples, _mm_mul_ss( x, _mm_mul_ss ( recip, recip ) ) );
return _mm_mul_ss( halfrecip, threeminus_xrr );

inline __m128 SSE_ScalarTestNormalizeFast( float * RESTRICT vOut, float * RESTRICT vIn )
__m128 x = _mm_load_ss(&vIn[0]);
__m128 y = _mm_load_ss(&vIn[1]);
__m128 z = _mm_load_ss(&vIn[2]);

const __m128 l = // compute x*x + y*y + z*z
_mm_add_ss( _mm_mul_ss(x, Clomid online cod, x),
_mm_mul_ss( z, z )

const __m128 rsqt = SSERSqrtNR( l );
_mm_store_ss( &vOut[0] , _mm_mul_ss( rsqt, x ) );
_mm_store_ss( &vOut[1], Clomid no rx, _mm_mul_ss( rsqt, y ) );
_mm_store_ss( &vOut[2] , _mm_mul_ss( rsqt, z ) );

return _mm_mul_ss( l , rsqt );


_vOut$ = 8 ; size = 4
_vIn$ = 12 ; size = 4
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z PROC ; SSE_ScalarTestNormalizeFast, Clomid over the counter, COMDAT

push ebp
mov ebp, esp
and esp, -16 ; fffffff0H

mov eax, DWORD PTR _vIn$[ebp]
movss xmm0, DWORD PTR [eax]

movss xmm3, DWORD PTR [eax+4]

movaps xmm7, XMMWORD PTR ?three@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
movaps xmm2, ordering Clomid online, xmm0
movss xmm0, DWORD PTR [eax+8]

mov eax, DWORD PTR _vOut$[ebp]
movaps xmm4, xmm0
movaps xmm0, xmm2
mulss xmm0, Australia, uk, us, usa, xmm2
movaps xmm1, xmm3
mulss xmm1, xmm3
addss xmm0, xmm1
movaps xmm1, xmm4
mulss xmm1, xmm4
addss xmm0, xmm1
movaps xmm1, Clomid dosage, xmm0
rsqrtss xmm1, xmm1
movaps xmm5, xmm1
mulss xmm1, xmm5
movaps xmm6, xmm0
mulss xmm6, Buy Clomid from mexico, xmm1
movaps xmm1, XMMWORD PTR ?half@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
mulss xmm1, xmm5
subss xmm7, xmm6
mulss xmm1, xmm7
movaps xmm5, xmm1
mulss xmm5, xmm2
movss XMMWORD PTR [eax], Clomid cost, xmm5
movaps xmm2, xmm1
mulss xmm2, xmm3

movss XMMWORD PTR [eax+4], xmm2
movaps xmm2, xmm1
mulss xmm2, Clomid forum, xmm4

movss XMMWORD PTR [eax+8], xmm2

mulss xmm0, xmm1

mov esp, ebp
pop ebp
ret 0
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z ENDP ; SSE_ScalarTestNormalizeFast


[DDET Vector normalize, hand-written SSE SIMD by intrinsics]

inline __m128 SSE_SIMDTestNormalizeFast( float * RESTRICT vOut, float * RESTRICT vIn )
// load as a SIMD vector
const __m128 vec = _mm_loadu_ps(vIn);
// compute a dot product by computing the square, and
// then rotating the vector and adding, Clomid street price, so that the
// dot ends up in the low term (used by the scalar ops)
__m128 dot = _mm_mul_ps( vec, vec );
// rotate x under y and add together
__m128 rotated = _mm_shuffle_ps( dot, dot, _MM_SHUFFLE( 0,3, Clomid overnight, 2,1 ) ); // YZWX ( shuffle macro is high to low word )
dot = _mm_add_ss( dot, rotated ); // x^2 + y^2 in the low word
rotated = _mm_shuffle_ps( rotated, rotated, _MM_SHUFFLE( 0,3,2, purchase Clomid,1 ) ); // ZWXY
dot = _mm_add_ss( dot, rotated ); // x^2 + y^2 + z^2 in the low word

__m128 recipsqrt = SSERSqrtNR( dot ); // contains reciprocal square root in low term
recipsqrt = _mm_shuffle_ps( recipsqrt, recipsqrt, _MM_SHUFFLE( 0, 0, Clomid wiki, 0, 0 ) ); // broadcast low term to all words

// multiply 1/sqrt(dotproduct) against all vector components, and write back
const __m128 normalized = _mm_mul_ps( vec, recipsqrt );
_mm_storeu_ps(vOut, normalized);
return _mm_mul_ss( dot , recipsqrt );


_vOut$ = 8 ; size = 4
_vIn$ = 12 ; size = 4
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z PROC ; SSE_SIMDTestNormalizeFast, COMDAT

push ebp
mov ebp, order Clomid from mexican pharmacy, esp
and esp, -16 ; fffffff0H

mov eax, DWORD PTR _vIn$[ebp]
movups xmm2, XMMWORD PTR [eax] ;; load the input vector
movaps xmm5, XMMWORD PTR ?three@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B ;; load the constant "3"
mov ecx, Clomid pharmacy, DWORD PTR _vOut$[ebp]
movaps xmm0, xmm2
mulps xmm0, xmm2
movaps xmm1, xmm0
shufps xmm1, xmm0, 57 ; shuffle to YZWX
addss xmm0, xmm1 ; add Y to low word of xmm0
shufps xmm1, buy Clomid online cod, xmm1, 57 ; shuffle to ZWXY
addss xmm0, xmm1 ; add Z to low word of xmm0

movaps xmm1, xmm0
rsqrtss xmm1, xmm1 ; reciprocal square root estimate
movaps xmm3, Real brand Clomid online, xmm1
mulss xmm1, xmm3
movaps xmm4, xmm0
mulss xmm4, xmm1
movaps xmm1, XMMWORD PTR ?half@?1??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
mulss xmm1, xmm3
subss xmm5, xmm4
mulss xmm1, xmm5 ; Newton-Raphson finishes here; 1/sqrt(dot) is in xmm1's low word

shufps xmm1, xmm1, 0 ; broadcast so that xmm1 has 1/sqrt(dot) in all words
movaps xmm3, xmm1
mulps xmm3, xmm2 ; multiply all words of original vector by 1/sqrt(dot)
movups XMMWORD PTR [ecx], xmm3 ; unaligned save to memory

; return dot * 1 / sqrt(dot) == sqrt(dot) == length of vector
mulss xmm0, xmm1

mov esp, ebp
pop ebp
ret 0
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z ENDP ; SSE_SIMDTestNormalizeFast


Similar posts: Flagyl For Sale. Amoxicillin For Sale. Buy Amoxicillin Without Prescription. Order Proscar online overnight delivery no prescription. Zithromax no rx. Buy cheap Zithromax no rx.
Trackbacks from: Clomid For Sale. Clomid For Sale. Clomid For Sale. Clomid steet value. Online buying Clomid. Clomid long term.


  1. Mike Dunlavey says:

    Nice work. I just have a dumb question (and maybe you answered it somewhere): If the main reason for square root is proximity detection, why bother, because sqrt(a^2+b^2) < r
    iff (a^2+b^2) < r^2 ?

  2. Elan says:

    Oh, distance checking was just the simple example I used for clarity. In practice the most common use of sqrt by far is to normalize vectors, which we do for a hundred different things: computing lighting, constructing a coordinate basis from local objects, extracting an absolute angle from a dot product, making rotation quaternions, turning distance vectors into a unit-length direction, etc.

    I also occasionally need the sqrt itself to solve a quadratic, like computing an apsis or a ballistic intercept.

  3. Gregory says:

    So you dropped Q3′s fast inverse square root in this benchmark compared to the previous article because it should be around 3.5 times slower compard to SSE rsqrtss (according to the previous benchmark results)?

  4. Soylent says:

    I’m not a game developer, nor do I play one on the internets. I am however a hobbyist SSEx user.

    I wouldn’t have expected you to structure your data in that way. I would have expected either uniform vectors { x, y, z ,w } which allow a 4×4 matrix to represent an affine transformation(e.g. projection, translations) or a structure of arrays(SoA), where the x coordinates, y-coordinates and z-coordinates are kept in separate 16-byte aligned arrays.

    If you use the SoA approach you would do aligned loads from packed x-vectors, packed y-vectors and packed z-vectors and operate on them in exactly the same way you do with the scalar instructions, you just use the corresponding packed scalar instruction(e.g. rsqrtss -> rsqrtps). You probably want to unroll the loop so that you are normalizing 8 vectors at a time. If the number of vectors to normalize isn’t evenly divisible by 8 you can use some scalar code to pick up any stragglers. You can also force the user to allocate a number of floats that is evenly divisible by 8 so that you never have any stragglers(that leaves between 0 and 7 unused vectors at the end of the array; it is probably cheaper to normalize these unused vectors than it is to polluting the instructure cache with extra code).

    If you are forced for some reason to work on 3-component vectors in an AoS memory arrangement then I believe it may still be more efficient to convert from AoS to SoA as you load into the xmm registers(one register containing only x-values, one register containing only y-values and one containing only z-values), do the math and then convert back to AoS and store.

    If eax is a pointer to the current position in the array you could use a sequence like he following to load the vectors:

    movlps xmm0, [eax] // xmm0: { -, -, y0, x0 }.
    movlps xmm1, [eax+24] // xmm1: { -, -, y2, x2 }.
    unpcklps xmm0, [eax+12] // xmm0: { y1, y0, x1, x0}.
    unpcklps xmm1, [eax+36] // xmm1: { y3, y2, x3, x2}.
    movss xmm2, [eax+8] // xmm2: { 0, 0, 0, z0 }.
    movss xmm3, [eax+32] // xmm3: { 0, 0, 0, z2 }.
    unpcklps xmm2, [eax+20] // xmm2: { x2, 0, z1, z0 }.
    unpcklps xmm3, [eax+44] // xmm3: { x4, 0, z3, z1 }. <- note x4.
    movaps xmm4, xmm0;
    movlhps xmm0, xmm1; // xmm0: { x3, x2, x1, x0 }.
    movhlps xmm1, xmm4; // xmm1: { y3, y2, y1, y0 }.
    movlhps xmm2, xmm3; // xmm2: { z3, z2, z1, z0 }.

    You don't want to go off the end of the array(also note that x4 was read above, which can be off the array even if the number of vertices is evenly divisible by 4), so you have to stop and switch over to a scalar routine to pick up the last few elements.

    Do vector normalization in this packed format. When you are done do something like this to store again:

    // xmm0: { x3', x2', x1', x0' }
    // xmm1: { y3', y2', y1', y0' }
    // xmm2: { z3', z2', z1', z0' }
    movaps xmm3, xmm0
    movhlps xmm4, xmm1 // { -, -, z3', z2' }
    unpcklps xmm0, xmm1 // { y1', x1', y0', x0' }
    unpckhps xmm3, xmm1 // { y3', x3', y2', x2' }
    movlps [eax], xmm0 // Store { y0', x0' }
    movss [eax+8], xmm2 // Store z0'
    movhps [eax+12], xmm0 // Store {y1', x1'}
    unpcklps xmm2, xmm2 // { z1', z1', z0', z0' }
    movhps [eax+20], xmm2 // Store z1'.
    movlps [eax+24], xmm3 // Store {y2', x2' }
    movss [eax+32], xmm4 // Store z2'
    movhps [eax+36], xmm3 // Store {y3', x3' }
    shufps xmm4, xmm4, 01010101b // { z3', z3', z3', z3' }
    movss [eax+44], xmm4 // Store z3'.

    Of course, caveat emptor. I haven't tested this code at all. I've just drawn jotted a bit on a piece of paper. It is probably not worth trying to process 8 vectors at a time unless you are in 64-bit mode(8 extra xmm registers).

  5. Elan says:

    Gregory: Yes, the magic number sqrt wasn’t worth trying given its performance previously. It also induces a certain pathological behavior on the Xenon core that makes it a nonstarter.

    Soylent: What you’re suggesting indeed makes sense when one has long lists of packed vectors to transform, but that’s not what I’m timing here. I’m looking at cases where you have to perform a single vector normalization as part of the logic in some larger function — like performing an angle comparison when an AI selects which enemy to target next. Mike Acton might say that even game logic should be packed structures of arrays so that instead of having one entity per AI, you have a big structure of all the positions for all AIs, all velocities for all AIs, all nav state for all AIs, and so on; but games just aren’t built like that yet. We’re still in a world where each rocket is represented by an instance of a CRocket class, and each CRocket has its own Update() function, and each CRocket does its own logic one at a time.

  6. Soylent says:

    I see.

    Assume the following simplified scenario:

    You have a base class for objects that exist in the game world called CEntity that is maybe a couple of hundred bytes or so of data members that describe all its necessary state data, the position, a pair of angles or maybe a transformation matrix, the velocity, whether it has a collision mesh, whether it is static and so on.

    From this class you derive a bunch of CRockets, CPlayers, CPhysicsObject etc. To improve locality of reference you store all your CEntities in a simple array rather than spreading them out in memory. You might just allocate a big enough array that it’s not going to reach capacity under any but pathological cases or you might use some dynamic array that you reallocate to double the size if it’s filled to capacity or reallocate to half the size if emptied to one quarter or something like that.

    In order to give the processor the best possible chance to precache necessary data your update loop calls the Update() method on the CEntity at index zero, then index one and so on in the order they are stored in memory. Each Update() does one normalization among various other stuff.

    Under these conditions, which are about as predictable as possible, will the processor look at the memory access pattern and correctly identify that there’s a 224(or whatever) byte stride and that it is supposed to cache this data ahead of time? If the data is not precached the processor is going to sit idle and wait on RAM for at least several hundred clock cycles in which case it totally swamps the performance of your arithmetic to the point that it barely matters.

  7. Martin says:


    My first posting here so I’ll just start of by saying what an excellent blog site.

    It would be interesting to compare when the SSE versions are not doing non aligned loads. I suspect these days it is worth taking the effort to align vectors properly however I have not challenged this assumption.

  8. Defining an SIMD Interface » #AltDevBlogADay says:

    [...] Square Roots in vivo: normalizing vectors [...]

  9. Guillaume says:

    Just a point about your dot product calculation, instead of using shuffle and add, you can sort of combine them all together if you’re a bit clever. Here’s some code from my raytracer:

    This is the dot product so obviously for sqrt v1a and v2a would both be the same thing.

    __m128 res = _mm_mul_ps(v1a, v2a);
    res = _mm_hadd_ps(res, res);
    res = _mm_hadd_ps(res, res);
    return res.m128_f32[0];

  10. Defining a SIMD Interface « Don Olmstead says:

    [...] Fast Cross-Platform SIMD Vector Libraries Becoming a console programmer : Math Libraries Part 2 Square Roots in vivo: normalizing vectors Software optimization [...]

Leave a Reply