<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for Some Assembly Required</title>
	<atom:link href="http://assemblyrequired.crashworks.org/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://assemblyrequired.crashworks.org</link>
	<description>Technical Notes On Game Development</description>
	<lastBuildDate>Thu, 04 Mar 2010 18:34:17 -0800</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>Comment on Down With fcmp: Conditional Moves For Branchless Math by &#187; Stupid C++ vs C# performance comparison Florent Clairambault</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/04/fcmp-conditional-moves-for-branchless-math/comment-page-1/#comment-5174</link>
		<dc:creator>&#187; Stupid C++ vs C# performance comparison Florent Clairambault</dc:creator>
		<pubDate>Thu, 04 Mar 2010 18:34:17 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=50#comment-5174</guid>
		<description>[...] Conditional moves [...]</description>
		<content:encoded><![CDATA[<p>[...] Conditional moves [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Square Roots in vivo: normalizing vectors by Soylent</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5171</link>
		<dc:creator>Soylent</dc:creator>
		<pubDate>Wed, 03 Mar 2010 08:24:27 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5171</guid>
		<description>I see. 

Assume the following simplified scenario:

You have a base class for objects that exist in the game world called CEntity that is maybe a couple of hundred bytes or so of data members that describe all its necessary state data, the position, a pair of angles or maybe a transformation matrix, the velocity, whether it has a collision mesh, whether it is static and so on.

From this class you derive a bunch of CRockets, CPlayers, CPhysicsObject etc. To improve locality of reference you store all your CEntities in a simple array rather than spreading them out in memory. You might just allocate a big enough array that it&#039;s not going to reach capacity under any but pathological cases or you might use some dynamic array that you reallocate to double the size if it&#039;s filled to capacity or reallocate to half the size if emptied to one quarter or something like that.

In order to give the processor the best possible chance to precache necessary data your update loop calls the Update() method on the CEntity at index zero, then index one and so on in the order they are stored in memory. Each Update() does one normalization among various other stuff.

Under these conditions, which are about as predictable as possible, will the processor look at the memory access pattern and correctly identify that there&#039;s a 224(or whatever) byte stride and that it is supposed to cache this data ahead of time? If the data is not precached the processor is going to sit idle and wait on RAM for at least several hundred clock cycles in which case it totally swamps the performance of your arithmetic to the point that it barely matters.</description>
		<content:encoded><![CDATA[<p>I see. </p>
<p>Assume the following simplified scenario:</p>
<p>You have a base class for objects that exist in the game world called CEntity that is maybe a couple of hundred bytes or so of data members that describe all its necessary state data, the position, a pair of angles or maybe a transformation matrix, the velocity, whether it has a collision mesh, whether it is static and so on.</p>
<p>From this class you derive a bunch of CRockets, CPlayers, CPhysicsObject etc. To improve locality of reference you store all your CEntities in a simple array rather than spreading them out in memory. You might just allocate a big enough array that it&#8217;s not going to reach capacity under any but pathological cases or you might use some dynamic array that you reallocate to double the size if it&#8217;s filled to capacity or reallocate to half the size if emptied to one quarter or something like that.</p>
<p>In order to give the processor the best possible chance to precache necessary data your update loop calls the Update() method on the CEntity at index zero, then index one and so on in the order they are stored in memory. Each Update() does one normalization among various other stuff.</p>
<p>Under these conditions, which are about as predictable as possible, will the processor look at the memory access pattern and correctly identify that there&#8217;s a 224(or whatever) byte stride and that it is supposed to cache this data ahead of time? If the data is not precached the processor is going to sit idle and wait on RAM for at least several hundred clock cycles in which case it totally swamps the performance of your arithmetic to the point that it barely matters.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Square Roots in vivo: normalizing vectors by Elan</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5166</link>
		<dc:creator>Elan</dc:creator>
		<pubDate>Thu, 25 Feb 2010 01:22:32 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5166</guid>
		<description>Gregory: Yes, the magic number sqrt wasn&#039;t worth trying given its performance previously. It also induces a certain pathological behavior on the Xenon core that makes it a nonstarter.

Soylent: What you&#039;re suggesting indeed makes sense when one has long lists of packed vectors to transform, but that&#039;s not what I&#039;m timing here. I&#039;m looking at cases where you have to perform a single vector normalization as part of the logic in some larger function -- like performing an angle comparison when an AI selects which enemy to target next. &lt;a href=&quot;http://cellperformance.beyond3d.com/articles/2008/03/three-big-lies.html&quot; rel=&quot;nofollow&quot;&gt;Mike Acton might say that even game logic should be packed structures of arrays&lt;/a&gt; so that instead of having one entity per AI, you have a big structure of all the positions for all AIs, all velocities for all AIs, all nav state for all AIs, and so on; but games just aren&#039;t built like that yet. We&#039;re still in a world where each rocket is represented by an instance of a CRocket class, and each CRocket has its own Update() function, and each CRocket does its own logic one at a time.</description>
		<content:encoded><![CDATA[<p>Gregory: Yes, the magic number sqrt wasn&#8217;t worth trying given its performance previously. It also induces a certain pathological behavior on the Xenon core that makes it a nonstarter.</p>
<p>Soylent: What you&#8217;re suggesting indeed makes sense when one has long lists of packed vectors to transform, but that&#8217;s not what I&#8217;m timing here. I&#8217;m looking at cases where you have to perform a single vector normalization as part of the logic in some larger function &#8212; like performing an angle comparison when an AI selects which enemy to target next. <a href="http://cellperformance.beyond3d.com/articles/2008/03/three-big-lies.html" rel="nofollow">Mike Acton might say that even game logic should be packed structures of arrays</a> so that instead of having one entity per AI, you have a big structure of all the positions for all AIs, all velocities for all AIs, all nav state for all AIs, and so on; but games just aren&#8217;t built like that yet. We&#8217;re still in a world where each rocket is represented by an instance of a CRocket class, and each CRocket has its own Update() function, and each CRocket does its own logic one at a time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Timing square root by Gregory</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5165</link>
		<dc:creator>Gregory</dc:creator>
		<pubDate>Wed, 24 Feb 2010 13:38:22 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5165</guid>
		<description>Hello Elan,

Thank you for the answer.

Since I posted my comment, I came across http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx so I think I&#039;m going to implement my high precision timer using QPC on Windows and gettimeofday on Linux and Mac.

Or the easy way seems to be Boost.PTime</description>
		<content:encoded><![CDATA[<p>Hello Elan,</p>
<p>Thank you for the answer.</p>
<p>Since I posted my comment, I came across <a href="http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx" rel="nofollow">http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx</a> so I think I&#8217;m going to implement my high precision timer using QPC on Windows and gettimeofday on Linux and Mac.</p>
<p>Or the easy way seems to be Boost.PTime</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Timing square root by Elan</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5163</link>
		<dc:creator>Elan</dc:creator>
		<pubDate>Tue, 23 Feb 2010 04:28:21 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5163</guid>
		<description>We use rtdsc to implement our cycle counter, since it seems to have the best resolution of any of the timers available.

It&#039;s best to put target code in a loop for the same reason you would weigh rice by the thousand rather than one grain at a time on your kitchen scale: systematic and random error.

First, there is a certain &lt;a href=&quot;http://en.wikipedia.org/wiki/Systematic_error&quot; rel=&quot;nofollow&quot;&gt;systematic error&lt;/a&gt; in querying the cycle counter -- the StartCycleCounter() inline function call and the rtdsc op itself have some latency, and the timer is probably only accurate to within a couple of nanoseconds, measuring a single iteration of an 86ns operation would have a large relative error. On the other hand, a relative error of 10&lt;sup&gt;-8&lt;/sup&gt; in 10&lt;sup&gt;-3&lt;/sup&gt; seconds is much smaller, and so more accurate. 

Also, timings &lt;em&gt;in vivo&lt;/em&gt; can be &lt;a href=&quot;http://en.wikipedia.org/wiki/Random_error&quot; rel=&quot;nofollow&quot;&gt;messy&lt;/a&gt;: any single iteration of the loop might take a little longer than expected because of other threads, memory bus contention, clock variability, operating system intervention, even CPU temperature. Taking multiple measurements, or a single measurement of multiple iterations, improves statistical significance and narrows the &lt;a href=&quot;http://en.wikipedia.org/wiki/Confidence_interval&quot; rel=&quot;nofollow&quot;&gt;error bars&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>We use rtdsc to implement our cycle counter, since it seems to have the best resolution of any of the timers available.</p>
<p>It&#8217;s best to put target code in a loop for the same reason you would weigh rice by the thousand rather than one grain at a time on your kitchen scale: systematic and random error.</p>
<p>First, there is a certain <a href="http://en.wikipedia.org/wiki/Systematic_error" rel="nofollow">systematic error</a> in querying the cycle counter &#8212; the StartCycleCounter() inline function call and the rtdsc op itself have some latency, and the timer is probably only accurate to within a couple of nanoseconds, measuring a single iteration of an 86ns operation would have a large relative error. On the other hand, a relative error of 10<sup>-8</sup> in 10<sup>-3</sup> seconds is much smaller, and so more accurate. </p>
<p>Also, timings <em>in vivo</em> can be <a href="http://en.wikipedia.org/wiki/Random_error" rel="nofollow">messy</a>: any single iteration of the loop might take a little longer than expected because of other threads, memory bus contention, clock variability, operating system intervention, even CPU temperature. Taking multiple measurements, or a single measurement of multiple iterations, improves statistical significance and narrows the <a href="http://en.wikipedia.org/wiki/Confidence_interval" rel="nofollow">error bars</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Square Roots in vivo: normalizing vectors by Soylent</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5161</link>
		<dc:creator>Soylent</dc:creator>
		<pubDate>Mon, 22 Feb 2010 01:48:11 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5161</guid>
		<description>I&#039;m not a game developer, nor do I play one on the internets. I am however a hobbyist SSEx user.

I wouldn&#039;t have expected you to structure your data in that way. I would have expected either uniform vectors { x, y, z ,w } which allow a 4x4 matrix to represent an affine transformation(e.g. projection, translations) or a structure of arrays(SoA), where the x coordinates, y-coordinates and z-coordinates are kept in separate 16-byte aligned arrays.

If you use the SoA approach you would do aligned loads from packed x-vectors, packed y-vectors and packed z-vectors and operate on them in exactly the same way you do with the scalar instructions, you just use the corresponding packed scalar instruction(e.g. rsqrtss -&gt; rsqrtps). You probably want to unroll the loop so that you are normalizing 8 vectors at a time. If the number of vectors to normalize isn&#039;t evenly divisible by 8 you can use some scalar code to pick up any stragglers. You can also force the user to allocate a number of floats that is evenly divisible by 8 so that you never have any stragglers(that leaves between 0 and 7 unused vectors at the end of the array; it is probably cheaper to normalize these unused vectors than it is to polluting the instructure cache with extra code).

If you are forced for some reason to work on 3-component vectors in an AoS memory arrangement then I believe it may still be more efficient to convert from AoS to SoA as you load into the xmm registers(one register containing only x-values, one register containing only y-values and one containing only z-values), do the math and then convert back to AoS and store.

If eax is a pointer to the current position in the array you could use a sequence like he following to load the vectors:

movlps   xmm0, [eax]    // xmm0: { -, -, y0, x0 }.
movlps   xmm1, [eax+24] // xmm1: { -, -, y2, x2 }.
unpcklps xmm0, [eax+12] // xmm0: { y1, y0, x1, x0}.
unpcklps xmm1, [eax+36] // xmm1: { y3, y2, x3, x2}.
movss    xmm2, [eax+8]  // xmm2: { 0, 0, 0, z0 }.
movss    xmm3, [eax+32] // xmm3: { 0, 0, 0, z2 }.
unpcklps xmm2, [eax+20] // xmm2: { x2, 0, z1, z0 }.
unpcklps xmm3, [eax+44] // xmm3: { x4, 0, z3, z1 }. &lt;- note x4.
movaps   xmm4, xmm0;
movlhps  xmm0, xmm1;    // xmm0: { x3, x2, x1, x0 }.
movhlps  xmm1, xmm4;    // xmm1: { y3, y2, y1, y0 }.
movlhps  xmm2, xmm3;    // xmm2: { z3, z2, z1, z0 }.

You don&#039;t want to go off the end of the array(also note that x4 was read above, which can be off the array even if the number of vertices is evenly divisible by 4), so you have to stop and switch over to a scalar routine to pick up the last few elements.

Do vector normalization in this packed format. When you are done do something like this to store again:

// xmm0: { x3&#039;, x2&#039;, x1&#039;, x0&#039; }
// xmm1: { y3&#039;, y2&#039;, y1&#039;, y0&#039; }
// xmm2: { z3&#039;, z2&#039;, z1&#039;, z0&#039; }
movaps xmm3, xmm0
movhlps xmm4, xmm1     // { -, -, z3&#039;, z2&#039; }
unpcklps xmm0, xmm1    // { y1&#039;, x1&#039;, y0&#039;, x0&#039; }
unpckhps xmm3, xmm1    // { y3&#039;, x3&#039;, y2&#039;, x2&#039; }
movlps [eax], xmm0     // Store { y0&#039;, x0&#039; }
movss [eax+8], xmm2    // Store z0&#039;
movhps [eax+12], xmm0  // Store {y1&#039;, x1&#039;}
unpcklps xmm2, xmm2    // { z1&#039;, z1&#039;, z0&#039;, z0&#039; }
movhps [eax+20], xmm2  // Store z1&#039;.
movlps [eax+24], xmm3  // Store {y2&#039;, x2&#039; }
movss  [eax+32], xmm4  // Store z2&#039;
movhps [eax+36], xmm3  // Store {y3&#039;, x3&#039; }
shufps xmm4, xmm4, 01010101b // { z3&#039;, z3&#039;, z3&#039;, z3&#039; }
movss [eax+44], xmm4  // Store z3&#039;.

Of course, caveat emptor. I haven&#039;t tested this code at all. I&#039;ve just drawn jotted a bit on a piece of paper. It is probably not worth trying to process 8 vectors at a time unless you are in 64-bit mode(8 extra xmm registers).</description>
		<content:encoded><![CDATA[<p>I&#8217;m not a game developer, nor do I play one on the internets. I am however a hobbyist SSEx user.</p>
<p>I wouldn&#8217;t have expected you to structure your data in that way. I would have expected either uniform vectors { x, y, z ,w } which allow a 4&#215;4 matrix to represent an affine transformation(e.g. projection, translations) or a structure of arrays(SoA), where the x coordinates, y-coordinates and z-coordinates are kept in separate 16-byte aligned arrays.</p>
<p>If you use the SoA approach you would do aligned loads from packed x-vectors, packed y-vectors and packed z-vectors and operate on them in exactly the same way you do with the scalar instructions, you just use the corresponding packed scalar instruction(e.g. rsqrtss -&gt; rsqrtps). You probably want to unroll the loop so that you are normalizing 8 vectors at a time. If the number of vectors to normalize isn&#8217;t evenly divisible by 8 you can use some scalar code to pick up any stragglers. You can also force the user to allocate a number of floats that is evenly divisible by 8 so that you never have any stragglers(that leaves between 0 and 7 unused vectors at the end of the array; it is probably cheaper to normalize these unused vectors than it is to polluting the instructure cache with extra code).</p>
<p>If you are forced for some reason to work on 3-component vectors in an AoS memory arrangement then I believe it may still be more efficient to convert from AoS to SoA as you load into the xmm registers(one register containing only x-values, one register containing only y-values and one containing only z-values), do the math and then convert back to AoS and store.</p>
<p>If eax is a pointer to the current position in the array you could use a sequence like he following to load the vectors:</p>
<p>movlps   xmm0, [eax]    // xmm0: { -, -, y0, x0 }.<br />
movlps   xmm1, [eax+24] // xmm1: { -, -, y2, x2 }.<br />
unpcklps xmm0, [eax+12] // xmm0: { y1, y0, x1, x0}.<br />
unpcklps xmm1, [eax+36] // xmm1: { y3, y2, x3, x2}.<br />
movss    xmm2, [eax+8]  // xmm2: { 0, 0, 0, z0 }.<br />
movss    xmm3, [eax+32] // xmm3: { 0, 0, 0, z2 }.<br />
unpcklps xmm2, [eax+20] // xmm2: { x2, 0, z1, z0 }.<br />
unpcklps xmm3, [eax+44] // xmm3: { x4, 0, z3, z1 }. &lt;- note x4.<br />
movaps   xmm4, xmm0;<br />
movlhps  xmm0, xmm1;    // xmm0: { x3, x2, x1, x0 }.<br />
movhlps  xmm1, xmm4;    // xmm1: { y3, y2, y1, y0 }.<br />
movlhps  xmm2, xmm3;    // xmm2: { z3, z2, z1, z0 }.</p>
<p>You don&#039;t want to go off the end of the array(also note that x4 was read above, which can be off the array even if the number of vertices is evenly divisible by 4), so you have to stop and switch over to a scalar routine to pick up the last few elements.</p>
<p>Do vector normalization in this packed format. When you are done do something like this to store again:</p>
<p>// xmm0: { x3&#039;, x2&#039;, x1&#039;, x0&#039; }<br />
// xmm1: { y3&#039;, y2&#039;, y1&#039;, y0&#039; }<br />
// xmm2: { z3&#039;, z2&#039;, z1&#039;, z0&#039; }<br />
movaps xmm3, xmm0<br />
movhlps xmm4, xmm1     // { -, -, z3&#039;, z2&#039; }<br />
unpcklps xmm0, xmm1    // { y1&#039;, x1&#039;, y0&#039;, x0&#039; }<br />
unpckhps xmm3, xmm1    // { y3&#039;, x3&#039;, y2&#039;, x2&#039; }<br />
movlps [eax], xmm0     // Store { y0&#039;, x0&#039; }<br />
movss [eax+8], xmm2    // Store z0&#039;<br />
movhps [eax+12], xmm0  // Store {y1&#039;, x1&#039;}<br />
unpcklps xmm2, xmm2    // { z1&#039;, z1&#039;, z0&#039;, z0&#039; }<br />
movhps [eax+20], xmm2  // Store z1&#039;.<br />
movlps [eax+24], xmm3  // Store {y2&#039;, x2&#039; }<br />
movss  [eax+32], xmm4  // Store z2&#039;<br />
movhps [eax+36], xmm3  // Store {y3&#039;, x3&#039; }<br />
shufps xmm4, xmm4, 01010101b // { z3&#039;, z3&#039;, z3&#039;, z3&#039; }<br />
movss [eax+44], xmm4  // Store z3&#039;.</p>
<p>Of course, caveat emptor. I haven&#039;t tested this code at all. I&#039;ve just drawn jotted a bit on a piece of paper. It is probably not worth trying to process 8 vectors at a time unless you are in 64-bit mode(8 extra xmm registers).</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Timing square root by Gregory</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5155</link>
		<dc:creator>Gregory</dc:creator>
		<pubDate>Wed, 17 Feb 2010 19:14:26 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5155</guid>
		<description>I&#039;m curious about what&#039;s behind StartClockCycleCounter(); and StopClockCycleCounter();

What&#039;s the best way to benchmark a portion of code on PC? QueryPerformanceCounter (windows only), RTDSC? something else?

And is there a point putting the target code inside a loop?</description>
		<content:encoded><![CDATA[<p>I&#8217;m curious about what&#8217;s behind StartClockCycleCounter(); and StopClockCycleCounter();</p>
<p>What&#8217;s the best way to benchmark a portion of code on PC? QueryPerformanceCounter (windows only), RTDSC? something else?</p>
<p>And is there a point putting the target code inside a loop?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on More on __restrict by Gregory</title>
		<link>http://assemblyrequired.crashworks.org/2008/09/06/more-on-__restrict/comment-page-1/#comment-5154</link>
		<dc:creator>Gregory</dc:creator>
		<pubDate>Wed, 17 Feb 2010 10:36:28 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.wordpress.com/?p=14#comment-5154</guid>
		<description>Here is the working URL for Mike Acton&#039;s article on understanding strict aliasing:

http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html</description>
		<content:encoded><![CDATA[<p>Here is the working URL for Mike Acton&#8217;s article on understanding strict aliasing:</p>
<p><a href="http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html" rel="nofollow">http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Square Roots in vivo: normalizing vectors by Gregory</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/comment-page-1/#comment-5153</link>
		<dc:creator>Gregory</dc:creator>
		<pubDate>Mon, 15 Feb 2010 18:04:07 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329#comment-5153</guid>
		<description>So you dropped Q3&#039;s fast inverse square root in this benchmark compared to the previous article because it should be around 3.5 times slower compard to SSE rsqrtss (according to the previous benchmark results)?</description>
		<content:encoded><![CDATA[<p>So you dropped Q3&#8217;s fast inverse square root in this benchmark compared to the previous article because it should be around 3.5 times slower compard to SSE rsqrtss (according to the previous benchmark results)?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Load-hit-stores and the __restrict keyword by Robert</title>
		<link>http://assemblyrequired.crashworks.org/2008/07/08/load-hit-stores-and-the-__restrict-keyword/comment-page-1/#comment-5148</link>
		<dc:creator>Robert</dc:creator>
		<pubDate>Wed, 06 Jan 2010 09:00:47 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.wordpress.com/?p=8#comment-5148</guid>
		<description>Actually, it&#039;s assumed that any pointer can alias a variable of any type unless you compile with a strict-aliasing flag.  Look up type-punning.

Here&#039;s a good article:  http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html</description>
		<content:encoded><![CDATA[<p>Actually, it&#8217;s assumed that any pointer can alias a variable of any type unless you compile with a strict-aliasing flag.  Look up type-punning.</p>
<p>Here&#8217;s a good article:  <a href="http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html" rel="nofollow">http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>
