<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Square Roots in vivo: normalizing vectors</title>
	<atom:link href="http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/feed/" rel="self" type="application/rss+xml" />
	<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/</link>
	<description>Technical Notes On Game Development</description>
	<lastBuildDate>Wed, 21 Jul 2010 18:22:16 -0700</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Soylent</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5171</link>
		<dc:creator>Soylent</dc:creator>
		<pubDate>Wed, 03 Mar 2010 08:24:27 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5171</guid>
		<description>I see. 

Assume the following simplified scenario:

You have a base class for objects that exist in the game world called CEntity that is maybe a couple of hundred bytes or so of data members that describe all its necessary state data, the position, a pair of angles or maybe a transformation matrix, the velocity, whether it has a collision mesh, whether it is static and so on.

From this class you derive a bunch of CRockets, CPlayers, CPhysicsObject etc. To improve locality of reference you store all your CEntities in a simple array rather than spreading them out in memory. You might just allocate a big enough array that it&#039;s not going to reach capacity under any but pathological cases or you might use some dynamic array that you reallocate to double the size if it&#039;s filled to capacity or reallocate to half the size if emptied to one quarter or something like that.

In order to give the processor the best possible chance to precache necessary data your update loop calls the Update() method on the CEntity at index zero, then index one and so on in the order they are stored in memory. Each Update() does one normalization among various other stuff.

Under these conditions, which are about as predictable as possible, will the processor look at the memory access pattern and correctly identify that there&#039;s a 224(or whatever) byte stride and that it is supposed to cache this data ahead of time? If the data is not precached the processor is going to sit idle and wait on RAM for at least several hundred clock cycles in which case it totally swamps the performance of your arithmetic to the point that it barely matters.</description>
		<content:encoded><![CDATA[<p>I see. </p>
<p>Assume the following simplified scenario:</p>
<p>You have a base class for objects that exist in the game world called CEntity that is maybe a couple of hundred bytes or so of data members that describe all its necessary state data, the position, a pair of angles or maybe a transformation matrix, the velocity, whether it has a collision mesh, whether it is static and so on.</p>
<p>From this class you derive a bunch of CRockets, CPlayers, CPhysicsObject etc. To improve locality of reference you store all your CEntities in a simple array rather than spreading them out in memory. You might just allocate a big enough array that it&#8217;s not going to reach capacity under any but pathological cases or you might use some dynamic array that you reallocate to double the size if it&#8217;s filled to capacity or reallocate to half the size if emptied to one quarter or something like that.</p>
<p>In order to give the processor the best possible chance to precache necessary data your update loop calls the Update() method on the CEntity at index zero, then index one and so on in the order they are stored in memory. Each Update() does one normalization among various other stuff.</p>
<p>Under these conditions, which are about as predictable as possible, will the processor look at the memory access pattern and correctly identify that there&#8217;s a 224(or whatever) byte stride and that it is supposed to cache this data ahead of time? If the data is not precached the processor is going to sit idle and wait on RAM for at least several hundred clock cycles in which case it totally swamps the performance of your arithmetic to the point that it barely matters.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elan</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5166</link>
		<dc:creator>Elan</dc:creator>
		<pubDate>Thu, 25 Feb 2010 01:22:32 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5166</guid>
		<description>Gregory: Yes, the magic number sqrt wasn&#039;t worth trying given its performance previously. It also induces a certain pathological behavior on the Xenon core that makes it a nonstarter.

Soylent: What you&#039;re suggesting indeed makes sense when one has long lists of packed vectors to transform, but that&#039;s not what I&#039;m timing here. I&#039;m looking at cases where you have to perform a single vector normalization as part of the logic in some larger function -- like performing an angle comparison when an AI selects which enemy to target next. &lt;a href=&quot;http://cellperformance.beyond3d.com/articles/2008/03/three-big-lies.html&quot; rel=&quot;nofollow&quot;&gt;Mike Acton might say that even game logic should be packed structures of arrays&lt;/a&gt; so that instead of having one entity per AI, you have a big structure of all the positions for all AIs, all velocities for all AIs, all nav state for all AIs, and so on; but games just aren&#039;t built like that yet. We&#039;re still in a world where each rocket is represented by an instance of a CRocket class, and each CRocket has its own Update() function, and each CRocket does its own logic one at a time.</description>
		<content:encoded><![CDATA[<p>Gregory: Yes, the magic number sqrt wasn&#8217;t worth trying given its performance previously. It also induces a certain pathological behavior on the Xenon core that makes it a nonstarter.</p>
<p>Soylent: What you&#8217;re suggesting indeed makes sense when one has long lists of packed vectors to transform, but that&#8217;s not what I&#8217;m timing here. I&#8217;m looking at cases where you have to perform a single vector normalization as part of the logic in some larger function &#8212; like performing an angle comparison when an AI selects which enemy to target next. <a href="http://cellperformance.beyond3d.com/articles/2008/03/three-big-lies.html" rel="nofollow">Mike Acton might say that even game logic should be packed structures of arrays</a> so that instead of having one entity per AI, you have a big structure of all the positions for all AIs, all velocities for all AIs, all nav state for all AIs, and so on; but games just aren&#8217;t built like that yet. We&#8217;re still in a world where each rocket is represented by an instance of a CRocket class, and each CRocket has its own Update() function, and each CRocket does its own logic one at a time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Soylent</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5161</link>
		<dc:creator>Soylent</dc:creator>
		<pubDate>Mon, 22 Feb 2010 01:48:11 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5161</guid>
		<description>I&#039;m not a game developer, nor do I play one on the internets. I am however a hobbyist SSEx user.

I wouldn&#039;t have expected you to structure your data in that way. I would have expected either uniform vectors { x, y, z ,w } which allow a 4x4 matrix to represent an affine transformation(e.g. projection, translations) or a structure of arrays(SoA), where the x coordinates, y-coordinates and z-coordinates are kept in separate 16-byte aligned arrays.

If you use the SoA approach you would do aligned loads from packed x-vectors, packed y-vectors and packed z-vectors and operate on them in exactly the same way you do with the scalar instructions, you just use the corresponding packed scalar instruction(e.g. rsqrtss -&gt; rsqrtps). You probably want to unroll the loop so that you are normalizing 8 vectors at a time. If the number of vectors to normalize isn&#039;t evenly divisible by 8 you can use some scalar code to pick up any stragglers. You can also force the user to allocate a number of floats that is evenly divisible by 8 so that you never have any stragglers(that leaves between 0 and 7 unused vectors at the end of the array; it is probably cheaper to normalize these unused vectors than it is to polluting the instructure cache with extra code).

If you are forced for some reason to work on 3-component vectors in an AoS memory arrangement then I believe it may still be more efficient to convert from AoS to SoA as you load into the xmm registers(one register containing only x-values, one register containing only y-values and one containing only z-values), do the math and then convert back to AoS and store.

If eax is a pointer to the current position in the array you could use a sequence like he following to load the vectors:

movlps   xmm0, [eax]    // xmm0: { -, -, y0, x0 }.
movlps   xmm1, [eax+24] // xmm1: { -, -, y2, x2 }.
unpcklps xmm0, [eax+12] // xmm0: { y1, y0, x1, x0}.
unpcklps xmm1, [eax+36] // xmm1: { y3, y2, x3, x2}.
movss    xmm2, [eax+8]  // xmm2: { 0, 0, 0, z0 }.
movss    xmm3, [eax+32] // xmm3: { 0, 0, 0, z2 }.
unpcklps xmm2, [eax+20] // xmm2: { x2, 0, z1, z0 }.
unpcklps xmm3, [eax+44] // xmm3: { x4, 0, z3, z1 }. &lt;- note x4.
movaps   xmm4, xmm0;
movlhps  xmm0, xmm1;    // xmm0: { x3, x2, x1, x0 }.
movhlps  xmm1, xmm4;    // xmm1: { y3, y2, y1, y0 }.
movlhps  xmm2, xmm3;    // xmm2: { z3, z2, z1, z0 }.

You don&#039;t want to go off the end of the array(also note that x4 was read above, which can be off the array even if the number of vertices is evenly divisible by 4), so you have to stop and switch over to a scalar routine to pick up the last few elements.

Do vector normalization in this packed format. When you are done do something like this to store again:

// xmm0: { x3&#039;, x2&#039;, x1&#039;, x0&#039; }
// xmm1: { y3&#039;, y2&#039;, y1&#039;, y0&#039; }
// xmm2: { z3&#039;, z2&#039;, z1&#039;, z0&#039; }
movaps xmm3, xmm0
movhlps xmm4, xmm1     // { -, -, z3&#039;, z2&#039; }
unpcklps xmm0, xmm1    // { y1&#039;, x1&#039;, y0&#039;, x0&#039; }
unpckhps xmm3, xmm1    // { y3&#039;, x3&#039;, y2&#039;, x2&#039; }
movlps [eax], xmm0     // Store { y0&#039;, x0&#039; }
movss [eax+8], xmm2    // Store z0&#039;
movhps [eax+12], xmm0  // Store {y1&#039;, x1&#039;}
unpcklps xmm2, xmm2    // { z1&#039;, z1&#039;, z0&#039;, z0&#039; }
movhps [eax+20], xmm2  // Store z1&#039;.
movlps [eax+24], xmm3  // Store {y2&#039;, x2&#039; }
movss  [eax+32], xmm4  // Store z2&#039;
movhps [eax+36], xmm3  // Store {y3&#039;, x3&#039; }
shufps xmm4, xmm4, 01010101b // { z3&#039;, z3&#039;, z3&#039;, z3&#039; }
movss [eax+44], xmm4  // Store z3&#039;.

Of course, caveat emptor. I haven&#039;t tested this code at all. I&#039;ve just drawn jotted a bit on a piece of paper. It is probably not worth trying to process 8 vectors at a time unless you are in 64-bit mode(8 extra xmm registers).</description>
		<content:encoded><![CDATA[<p>I&#8217;m not a game developer, nor do I play one on the internets. I am however a hobbyist SSEx user.</p>
<p>I wouldn&#8217;t have expected you to structure your data in that way. I would have expected either uniform vectors { x, y, z ,w } which allow a 4&#215;4 matrix to represent an affine transformation(e.g. projection, translations) or a structure of arrays(SoA), where the x coordinates, y-coordinates and z-coordinates are kept in separate 16-byte aligned arrays.</p>
<p>If you use the SoA approach you would do aligned loads from packed x-vectors, packed y-vectors and packed z-vectors and operate on them in exactly the same way you do with the scalar instructions, you just use the corresponding packed scalar instruction(e.g. rsqrtss -&gt; rsqrtps). You probably want to unroll the loop so that you are normalizing 8 vectors at a time. If the number of vectors to normalize isn&#8217;t evenly divisible by 8 you can use some scalar code to pick up any stragglers. You can also force the user to allocate a number of floats that is evenly divisible by 8 so that you never have any stragglers(that leaves between 0 and 7 unused vectors at the end of the array; it is probably cheaper to normalize these unused vectors than it is to polluting the instructure cache with extra code).</p>
<p>If you are forced for some reason to work on 3-component vectors in an AoS memory arrangement then I believe it may still be more efficient to convert from AoS to SoA as you load into the xmm registers(one register containing only x-values, one register containing only y-values and one containing only z-values), do the math and then convert back to AoS and store.</p>
<p>If eax is a pointer to the current position in the array you could use a sequence like he following to load the vectors:</p>
<p>movlps   xmm0, [eax]    // xmm0: { -, -, y0, x0 }.<br />
movlps   xmm1, [eax+24] // xmm1: { -, -, y2, x2 }.<br />
unpcklps xmm0, [eax+12] // xmm0: { y1, y0, x1, x0}.<br />
unpcklps xmm1, [eax+36] // xmm1: { y3, y2, x3, x2}.<br />
movss    xmm2, [eax+8]  // xmm2: { 0, 0, 0, z0 }.<br />
movss    xmm3, [eax+32] // xmm3: { 0, 0, 0, z2 }.<br />
unpcklps xmm2, [eax+20] // xmm2: { x2, 0, z1, z0 }.<br />
unpcklps xmm3, [eax+44] // xmm3: { x4, 0, z3, z1 }. &lt;- note x4.<br />
movaps   xmm4, xmm0;<br />
movlhps  xmm0, xmm1;    // xmm0: { x3, x2, x1, x0 }.<br />
movhlps  xmm1, xmm4;    // xmm1: { y3, y2, y1, y0 }.<br />
movlhps  xmm2, xmm3;    // xmm2: { z3, z2, z1, z0 }.</p>
<p>You don&#039;t want to go off the end of the array(also note that x4 was read above, which can be off the array even if the number of vertices is evenly divisible by 4), so you have to stop and switch over to a scalar routine to pick up the last few elements.</p>
<p>Do vector normalization in this packed format. When you are done do something like this to store again:</p>
<p>// xmm0: { x3&#039;, x2&#039;, x1&#039;, x0&#039; }<br />
// xmm1: { y3&#039;, y2&#039;, y1&#039;, y0&#039; }<br />
// xmm2: { z3&#039;, z2&#039;, z1&#039;, z0&#039; }<br />
movaps xmm3, xmm0<br />
movhlps xmm4, xmm1     // { -, -, z3&#039;, z2&#039; }<br />
unpcklps xmm0, xmm1    // { y1&#039;, x1&#039;, y0&#039;, x0&#039; }<br />
unpckhps xmm3, xmm1    // { y3&#039;, x3&#039;, y2&#039;, x2&#039; }<br />
movlps [eax], xmm0     // Store { y0&#039;, x0&#039; }<br />
movss [eax+8], xmm2    // Store z0&#039;<br />
movhps [eax+12], xmm0  // Store {y1&#039;, x1&#039;}<br />
unpcklps xmm2, xmm2    // { z1&#039;, z1&#039;, z0&#039;, z0&#039; }<br />
movhps [eax+20], xmm2  // Store z1&#039;.<br />
movlps [eax+24], xmm3  // Store {y2&#039;, x2&#039; }<br />
movss  [eax+32], xmm4  // Store z2&#039;<br />
movhps [eax+36], xmm3  // Store {y3&#039;, x3&#039; }<br />
shufps xmm4, xmm4, 01010101b // { z3&#039;, z3&#039;, z3&#039;, z3&#039; }<br />
movss [eax+44], xmm4  // Store z3&#039;.</p>
<p>Of course, caveat emptor. I haven&#039;t tested this code at all. I&#039;ve just drawn jotted a bit on a piece of paper. It is probably not worth trying to process 8 vectors at a time unless you are in 64-bit mode(8 extra xmm registers).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gregory</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5153</link>
		<dc:creator>Gregory</dc:creator>
		<pubDate>Mon, 15 Feb 2010 18:04:07 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5153</guid>
		<description>So you dropped Q3&#039;s fast inverse square root in this benchmark compared to the previous article because it should be around 3.5 times slower compard to SSE rsqrtss (according to the previous benchmark results)?</description>
		<content:encoded><![CDATA[<p>So you dropped Q3&#8217;s fast inverse square root in this benchmark compared to the previous article because it should be around 3.5 times slower compard to SSE rsqrtss (according to the previous benchmark results)?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elan</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5102</link>
		<dc:creator>Elan</dc:creator>
		<pubDate>Wed, 04 Nov 2009 09:43:06 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5102</guid>
		<description>Oh, distance checking was just the simple example I used for clarity. In practice the most common use of sqrt by far is to normalize vectors, which we do for a hundred different things: computing lighting, constructing a coordinate basis from local objects, extracting an absolute angle from a dot product, making rotation quaternions, turning distance vectors into a unit-length direction, etc.  

I also occasionally need the sqrt itself to solve a quadratic, like computing an apsis or a ballistic intercept.</description>
		<content:encoded><![CDATA[<p>Oh, distance checking was just the simple example I used for clarity. In practice the most common use of sqrt by far is to normalize vectors, which we do for a hundred different things: computing lighting, constructing a coordinate basis from local objects, extracting an absolute angle from a dot product, making rotation quaternions, turning distance vectors into a unit-length direction, etc.  </p>
<p>I also occasionally need the sqrt itself to solve a quadratic, like computing an apsis or a ballistic intercept.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike Dunlavey</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5101</link>
		<dc:creator>Mike Dunlavey</dc:creator>
		<pubDate>Tue, 03 Nov 2009 15:27:54 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5101</guid>
		<description>Nice work. I just have a dumb question (and maybe you answered it somewhere): If the main reason for square root is proximity detection, why bother, because sqrt(a^2+b^2) &lt; r
iff (a^2+b^2) &lt; r^2 ?</description>
		<content:encoded><![CDATA[<p>Nice work. I just have a dumb question (and maybe you answered it somewhere): If the main reason for square root is proximity detection, why bother, because sqrt(a^2+b^2) &lt; r<br />
iff (a^2+b^2) &lt; r^2 ?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
