<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Some Assembly Required</title>
	<atom:link href="http://assemblyrequired.crashworks.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://assemblyrequired.crashworks.org</link>
	<description>Technical Notes On Game Development</description>
	<lastBuildDate>Wed, 04 Nov 2009 09:52:17 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Square Roots in vivo: normalizing vectors</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/#comments</comments>
		<pubDate>Tue, 20 Oct 2009 15:55:59 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329</guid>
		<description><![CDATA[Following my earlier article on timing various square-root functions on the x86, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector [...]]]></description>
			<content:encoded><![CDATA[<p>Following <a href="http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/">my earlier article on timing various square-root functions on the x86</a>, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector or normalizing it, like when you need to perform a distance check between two characters to determine whether they can see/shoot/etc each other. So, I wrote up a group of normalize functions, each using a different sqrt technique, and timed them.</p>
<p>The testbed was, as last time, an array of 2048 single-precision floating point numbers, this time interpreted as a packed list of 682 three-dimensional vectors. This number was chosen so that both it and the output array were sure to fit in the L1 cache; however, because three floats add up to twelve bytes, this means that three out of four vectors <b>were not aligned</b> to a 16-byte boundary, which is significant for the SIMD test case as I had to use the <code>movups</code> unaligned load op. Each timing case consisted of looping over this array of vectors 2048 times, normalizing each and writing the result to memory.</p>
<p>Each normalize function computed the length of the vector 1/&radic;(x<sup>2</sup> + y<sup>2</sup> + z<sup>2</sup>), multiplied each component by the reciprocal,  and then wrote it back through an output pointer. The main difference was in how the reciprocal square root was computed:</p>
<ul>
<li>via the x87 FPU, by simply compiling <code>1.0f/sqrt( x*x + y*y + z*z )</code></li>
<li>via the SSE scalar unit, by compiling <code>1.0f/sqrt( x*x + y*y + z*z )</code> with the <a href="http://msdn.microsoft.com/en-us/library/7t5yh4fd(VS.80).aspx">/arch:SSE2</a> option set; this causes the compiler to issue a <code>sqrtss</code> followed by an <code>fdiv</code> &mdash; <i>ie</i>, it computes the square root and then divides one by it
<li>via the SSE scalar unit, by using the estimated reciprocal square root intrinsic and then performing one step of Newton-Raphson iteration</li>
<li>via the SSE SIMD unit,  working on the whole vector at once</li>
</ul>
<p>In all cases the results were accurate to 22 bits of precision. The results for 1,396,736 vector normalizations were:</p>
<div align="center" >
<table border class="padded">
<tr>
<th>Method</th>
<th>Total time</th>
<th>Time per vector</th>
</tr>
<tr>
<td>Compiler <code>1.0/sqrt(x)</code> <br />x87 FPU <code>FSQRT</code></td>
<td>52.469ms</td>
<td>37.6ns</td>
</tr>
<tr>
<td>Compiler <code>1.0/sqrt(x)</code> <br />SSE scalar <code>sqrtss</code></td>
<td>27.233ms</td>
<td>19.5ns</td>
</tr>
<tr>
<td>SSE <b>scalar</b> ops<br /><code>rsqrtss</code> with one NR step</td>
<td>21.631ms</td>
<td>15.5ns</td>
</tr>
<tr>
<td>SSE SIMD ops <br /><code>rsqrtss</code> with one NR step</td>
<td>20.034ms</td>
<td>14.3ns</td>
</tr>
</table>
</div>
<p>Two things jump out here. First, even when the square root op is surrounded by lots of other math &mdash; multiplies, adds, loads, stores &mdash; optimizations such as this can make a huge difference. It&#8217;s not just the cost of the sqrt itself, but also that it&#8217;s unpipelined, which means it ties up an execution unit and prevents any other work from being done until it&#8217;s entirely completed. </p>
<p>Second, in this case, SIMD is only a very modest benefit. That&#8217;s because the input vectors are unaligned, and the two key steps of this operation, the dot product and the square root, are scalar in nature. (This is what&#8217;s meant by &#8220;horizontal&#8221; SIMD computation &mdash; operations between the components of one vector, rather than between the corresponding words of two vectors. Given a vector V &ni; &lt;x,y,z&gt;, the sum x + y + z is <i>horizontal</i>, but with two vectors V<sub>1</sub> and V<sub>2</sub>, V<sub>3</sub> = &lt;x<sub>1</sub>+x<sub>2</sub>, y<sub>1</sub>+y<sub>2</sub>, z<sub>1</sub>+z<sub>2</sub>&gt; is <i>vertical</i>.) So it really doesn&#8217;t play to SIMD&#8217;s strengths at all.</p>
<p>On the other hand, if I were to normalize four vectors at a time, so that four dot products and four rsqrts could be performed in parallel in the four words of a vector register, then the speed advantage of SIMD would be much greater. But, again, my goal wasn&#8217;t to test performance in tight loops over packed data &mdash; it was to figure out the best way to do something like an angle check in the middle of a character&#8217;s AI, where you usually deal with one vector at a time.</p>
<p>Source code for my testing functions below the jump. Note that each function writes the normalized vector through an out pointer, but also returns the original vector&#8217;s length. The hand-written intrinsic versions probably aren&#8217;t totally optimal, but they ought to be good enough to make the point.<br />
<span id="more-329"></span></p>
<p><a style="display:none;" id="ddetlink1998205190" href="javascript:expand(document.getElementById('ddet1998205190'))">Naive vector normalize, x87 FPU or SSE scalar</a>
<div class="ddet_div" id="ddet1998205190"><script language="JavaScript" type="text/javascript">expand(document.getElementById('ddet1998205190'));expand(document.getElementById('ddetlink1998205190'))</script>
<u>Source</u></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #666666;">// Normalizes an assumed 3-element vector starting</span>
<span style="color: #666666;">// at pointer V, and returns the length of the original</span>
<span style="color: #666666;">// vector.</span>
<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> NaiveTestNormalize<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vOut, <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vIn <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
        <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">float</span> l <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span><span style="color: #000040;">*</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">+</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span><span style="color: #000040;">*</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">+</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span><span style="color: #000040;">*</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
        <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">float</span> rsqt <span style="color: #000080;">=</span> <span style="color:#800080;">1.0f</span> <span style="color: #000040;">/</span> <span style="color: #0000dd;">sqrt</span><span style="color: #008000;">&#40;</span>l<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">*</span> rsqt<span style="color: #008080;">;</span>
        vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">*</span> rsqt<span style="color: #008080;">;</span>
        vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">*</span> rsqt<span style="color: #008080;">;</span>
        <span style="color: #0000ff;">return</span> rsqt <span style="color: #000040;">*</span> l<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><u>Assembly (x87 FPU)</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">PROC</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize, COMDAT</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 396  :        const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">-</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 397  :        const float rsqt = 1.0f / sqrt(l);</span>
<span style="color: #666666; font-style: italic;">; 398  :        vOut[0] = vIn[0] * rsqt;</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ecx</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">-</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmulp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">faddp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmulp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">faddp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fsqrt</span>
        <span style="color: #0000ff; font-weight: bold;">fld1</span>
        <span style="color: #0000ff; font-weight: bold;">fdivrp</span>  <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fstp</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 399  :        vOut[1] = vIn[1] * rsqt;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fstp</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 400  :        vOut[2] = vIn[2] * rsqt;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fstp</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 401  :        return rsqt * l;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fmulp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 402  : }</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">ENDP</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p><u>Assembly (compiler-issued SSE scalar)</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_l$ = <span style="color: #339933;">-</span><span style="color: #0000ff;">4</span>                                                <span style="color: #666666; font-style: italic;">; size = 4</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_rsqt$ = <span style="color: #0000ff;">12</span>                                             <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">PROC</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize, COMDAT</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 392  : {</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">push</span>    <span style="color: #00007f;">ecx</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 393  :        const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movss   xmm1<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movss   xmm2<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movss   xmm0<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 394  :        const float rsqt = 1.0f / sqrt(l);</span>
<span style="color: #666666; font-style: italic;">; 395  :        vOut[0] = vIn[0] * rsqt;</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movaps  xmm3<span style="color: #339933;">,</span> xmm2
        mulss   xmm3<span style="color: #339933;">,</span> xmm2
        movaps  xmm4<span style="color: #339933;">,</span> xmm1
        mulss   xmm4<span style="color: #339933;">,</span> xmm1
        addss   xmm3<span style="color: #339933;">,</span> xmm4
        movaps  xmm4<span style="color: #339933;">,</span> xmm0
        mulss   xmm4<span style="color: #339933;">,</span> xmm0
        addss   xmm3<span style="color: #339933;">,</span> xmm4
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _l$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm3
        sqrtss  xmm4<span style="color: #339933;">,</span> xmm3   <span style="color: #666666; font-style: italic;">;; slow full-precision square root gets stored in xmm4</span>
        movss   xmm3<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> __real@3f800000  <span style="color: #666666; font-style: italic;">;; store 1.0 in xmm3</span>
        divss   xmm3<span style="color: #339933;">,</span> xmm4  <span style="color: #666666; font-style: italic;">;; divide 1.0 / xmm4 to get the reciprocal square root !?!</span>
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _rsqt$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm3
&nbsp;
<span style="color: #666666; font-style: italic;">; 396  :        vOut[1] = vIn[1] * rsqt;</span>
<span style="color: #666666; font-style: italic;">; 397  :        vOut[2] = vIn[2] * rsqt;</span>
<span style="color: #666666; font-style: italic;">; 398  :        return rsqt * l;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _rsqt$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        mulss   xmm2<span style="color: #339933;">,</span> xmm3
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _l$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        mulss   xmm1<span style="color: #339933;">,</span> xmm3
        mulss   xmm0<span style="color: #339933;">,</span> xmm3
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm2
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm1
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm0
&nbsp;
<span style="color: #666666; font-style: italic;">; 399  : }</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">pop</span>     <span style="color: #00007f;">ecx</span>
        <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">ENDP</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p></div></p>
<p><a style="display:none;" id="ddetlink224551765" href="javascript:expand(document.getElementById('ddet224551765'))">Vector normalize, hand-written SSE scalar by intrinsics</a>
<div class="ddet_div" id="ddet224551765"><script language="JavaScript" type="text/javascript">expand(document.getElementById('ddet224551765'));expand(document.getElementById('ddetlink224551765'))</script>
<u>Source</u></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #666666;">// SSE scalar reciprocal sqrt using rsqrt op, plus one Newton-Rhaphson iteration</span>
<span style="color: #0000ff;">inline</span> __m128 SSERSqrtNR<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">const</span> __m128 x <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	__m128 recip <span style="color: #000080;">=</span> _mm_rsqrt_ss<span style="color: #008000;">&#40;</span> x <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>  <span style="color: #666666;">// &quot;estimate&quot; opcode</span>
	<span style="color: #0000ff;">const</span> <span style="color: #0000ff;">static</span> __m128 three <span style="color: #000080;">=</span> <span style="color: #008000;">&#123;</span> <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">3</span> <span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// aligned consts for fast load</span>
	<span style="color: #0000ff;">const</span> <span style="color: #0000ff;">static</span> __m128 half <span style="color: #000080;">=</span> <span style="color: #008000;">&#123;</span> <span style="color:#800080;">0.5</span>,<span style="color:#800080;">0.5</span>,<span style="color:#800080;">0.5</span>,<span style="color:#800080;">0.5</span> <span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span>
	__m128 halfrecip <span style="color: #000080;">=</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> half, recip <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	__m128 threeminus_xrr <span style="color: #000080;">=</span> _mm_sub_ss<span style="color: #008000;">&#40;</span> three, _mm_mul_ss<span style="color: #008000;">&#40;</span> x, _mm_mul_ss <span style="color: #008000;">&#40;</span> recip, recip <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">return</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> halfrecip, threeminus_xrr <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
&nbsp;
<span style="color: #0000ff;">inline</span> __m128 SSE_ScalarTestNormalizeFast<span style="color: #008000;">&#40;</span>  <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vOut, <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vIn <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
        __m128 x <span style="color: #000080;">=</span> _mm_load_ss<span style="color: #008000;">&#40;</span><span style="color: #000040;">&amp;</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        __m128 y <span style="color: #000080;">=</span> _mm_load_ss<span style="color: #008000;">&#40;</span><span style="color: #000040;">&amp;</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        __m128 z <span style="color: #000080;">=</span> _mm_load_ss<span style="color: #008000;">&#40;</span><span style="color: #000040;">&amp;</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
        <span style="color: #0000ff;">const</span> __m128 l <span style="color: #000080;">=</span>  <span style="color: #666666;">// compute x*x + y*y + z*z</span>
                _mm_add_ss<span style="color: #008000;">&#40;</span>
                 _mm_add_ss<span style="color: #008000;">&#40;</span> _mm_mul_ss<span style="color: #008000;">&#40;</span>x,x<span style="color: #008000;">&#41;</span>,
                             _mm_mul_ss<span style="color: #008000;">&#40;</span>y,y<span style="color: #008000;">&#41;</span>
                            <span style="color: #008000;">&#41;</span>,
                 _mm_mul_ss<span style="color: #008000;">&#40;</span> z, z <span style="color: #008000;">&#41;</span>
                <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
&nbsp;
        <span style="color: #0000ff;">const</span> __m128 rsqt <span style="color: #000080;">=</span> SSERSqrtNR<span style="color: #008000;">&#40;</span> l <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_store_ss<span style="color: #008000;">&#40;</span> <span style="color: #000040;">&amp;</span>vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> , _mm_mul_ss<span style="color: #008000;">&#40;</span> rsqt, x <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_store_ss<span style="color: #008000;">&#40;</span> <span style="color: #000040;">&amp;</span>vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> , _mm_mul_ss<span style="color: #008000;">&#40;</span> rsqt, y <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_store_ss<span style="color: #008000;">&#40;</span> <span style="color: #000040;">&amp;</span>vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> , _mm_mul_ss<span style="color: #008000;">&#40;</span> rsqt, z <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
        <span style="color: #0000ff;">return</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> l , rsqt <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><u>Assembly</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">PROC</span> <span style="color: #666666; font-style: italic;">; SSE_ScalarTestNormalizeFast, COMDAT</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">push</span>    <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ebp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">esp</span>
    <span style="color: #00007f; font-weight: bold;">and</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000ff;">16</span>                                <span style="color: #666666; font-style: italic;">; fffffff0H</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movss   xmm0<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
    movss   xmm3<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
    movaps  xmm7<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?three@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    movaps  xmm2<span style="color: #339933;">,</span> xmm0
    movss   xmm0<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movaps  xmm4<span style="color: #339933;">,</span> xmm0
    movaps  xmm0<span style="color: #339933;">,</span> xmm2
    mulss   xmm0<span style="color: #339933;">,</span> xmm2
    movaps  xmm1<span style="color: #339933;">,</span> xmm3
    mulss   xmm1<span style="color: #339933;">,</span> xmm3
    addss   xmm0<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> xmm4
    mulss   xmm1<span style="color: #339933;">,</span> xmm4
    addss   xmm0<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> xmm0
    rsqrtss xmm1<span style="color: #339933;">,</span> xmm1
    movaps  xmm5<span style="color: #339933;">,</span> xmm1
    mulss   xmm1<span style="color: #339933;">,</span> xmm5
    movaps  xmm6<span style="color: #339933;">,</span> xmm0
    mulss   xmm6<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?half@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    mulss   xmm1<span style="color: #339933;">,</span> xmm5
    subss   xmm7<span style="color: #339933;">,</span> xmm6
    mulss   xmm1<span style="color: #339933;">,</span> xmm7
    movaps  xmm5<span style="color: #339933;">,</span> xmm1
    mulss   xmm5<span style="color: #339933;">,</span> xmm2
    movss   XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm5
    movaps  xmm2<span style="color: #339933;">,</span> xmm1
    mulss   xmm2<span style="color: #339933;">,</span> xmm3
&nbsp;
    movss   XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm2
    movaps  xmm2<span style="color: #339933;">,</span> xmm1
    mulss   xmm2<span style="color: #339933;">,</span> xmm4
&nbsp;
    movss   XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm2
&nbsp;
    mulss   xmm0<span style="color: #339933;">,</span> xmm1
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">pop</span>     <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">ENDP</span> <span style="color: #666666; font-style: italic;">; SSE_ScalarTestNormalizeFast</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p></div></p>
<p><a style="display:none;" id="ddetlink1689219768" href="javascript:expand(document.getElementById('ddet1689219768'))">Vector normalize, hand-written SSE SIMD by intrinsics</a>
<div class="ddet_div" id="ddet1689219768"><script language="JavaScript" type="text/javascript">expand(document.getElementById('ddet1689219768'));expand(document.getElementById('ddetlink1689219768'))</script>
<u>Source</u></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">inline</span> __m128 SSE_SIMDTestNormalizeFast<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vOut, <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vIn  <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
        <span style="color: #666666;">// load as a SIMD vector</span>
        <span style="color: #0000ff;">const</span> __m128 vec <span style="color: #000080;">=</span> _mm_loadu_ps<span style="color: #008000;">&#40;</span>vIn<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        <span style="color: #666666;">// compute a dot product by computing the square, and</span>
        <span style="color: #666666;">// then rotating the vector and adding, so that the</span>
        <span style="color: #666666;">// dot ends up in the low term (used by the scalar ops)</span>
        __m128 dot <span style="color: #000080;">=</span> _mm_mul_ps<span style="color: #008000;">&#40;</span> vec, vec <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        <span style="color: #666666;">// rotate x under y and add together   </span>
        __m128 rotated <span style="color: #000080;">=</span> _mm_shuffle_ps<span style="color: #008000;">&#40;</span> dot, dot, _MM_SHUFFLE<span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">0</span>,<span style="color: #0000dd;">3</span>,<span style="color: #0000dd;">2</span>,<span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// YZWX ( shuffle macro is high to low word )</span>
        dot <span style="color: #000080;">=</span> _mm_add_ss<span style="color: #008000;">&#40;</span> dot, rotated <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// x^2 + y^2 in the low word</span>
        rotated <span style="color: #000080;">=</span> _mm_shuffle_ps<span style="color: #008000;">&#40;</span> rotated, rotated, _MM_SHUFFLE<span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">0</span>,<span style="color: #0000dd;">3</span>,<span style="color: #0000dd;">2</span>,<span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// ZWXY</span>
        dot <span style="color: #000080;">=</span> _mm_add_ss<span style="color: #008000;">&#40;</span> dot, rotated <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// x^2 + y^2 + z^2 in the low word</span>
&nbsp;
        __m128 recipsqrt <span style="color: #000080;">=</span> SSERSqrtNR<span style="color: #008000;">&#40;</span> dot <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// contains reciprocal square root in low term</span>
        recipsqrt <span style="color: #000080;">=</span> _mm_shuffle_ps<span style="color: #008000;">&#40;</span> recipsqrt, recipsqrt, _MM_SHUFFLE<span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">0</span>, <span style="color: #0000dd;">0</span>, <span style="color: #0000dd;">0</span>, <span style="color: #0000dd;">0</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// broadcast low term to all words</span>
&nbsp;
        <span style="color: #666666;">// multiply 1/sqrt(dotproduct) against all vector components, and write back</span>
        <span style="color: #0000ff;">const</span> __m128 normalized <span style="color: #000080;">=</span> _mm_mul_ps<span style="color: #008000;">&#40;</span> vec, recipsqrt <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_storeu_ps<span style="color: #008000;">&#40;</span>vOut, normalized<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        <span style="color: #0000ff;">return</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> dot , recipsqrt <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><u>Assembly</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">PROC</span>   <span style="color: #666666; font-style: italic;">; SSE_SIMDTestNormalizeFast, COMDAT</span>
&nbsp;
&nbsp;
    <span style="color: #00007f; font-weight: bold;">push</span>    <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ebp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">esp</span>
    <span style="color: #00007f; font-weight: bold;">and</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000ff;">16</span>                                <span style="color: #666666; font-style: italic;">; fffffff0H</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movups  xmm2<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #666666; font-style: italic;">;; load the input vector</span>
    movaps  xmm5<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?three@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B <span style="color: #666666; font-style: italic;">;; load the constant &quot;3&quot;</span>
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ecx</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movaps  xmm0<span style="color: #339933;">,</span> xmm2
    mulps   xmm0<span style="color: #339933;">,</span> xmm2
    movaps  xmm1<span style="color: #339933;">,</span> xmm0
    shufps  xmm1<span style="color: #339933;">,</span> xmm0<span style="color: #339933;">,</span> <span style="color: #0000ff;">57</span>	<span style="color: #666666; font-style: italic;">; shuffle to YZWX</span>
    addss   xmm0<span style="color: #339933;">,</span> xmm1      <span style="color: #666666; font-style: italic;">; add Y to low word of xmm0</span>
    shufps  xmm1<span style="color: #339933;">,</span> xmm1<span style="color: #339933;">,</span> <span style="color: #0000ff;">57</span>	<span style="color: #666666; font-style: italic;">; shuffle to ZWXY</span>
    addss   xmm0<span style="color: #339933;">,</span> xmm1      <span style="color: #666666; font-style: italic;">; add Z to low word of xmm0</span>
&nbsp;
    movaps  xmm1<span style="color: #339933;">,</span> xmm0        
    rsqrtss xmm1<span style="color: #339933;">,</span> xmm1      <span style="color: #666666; font-style: italic;">; reciprocal square root estimate</span>
    movaps  xmm3<span style="color: #339933;">,</span> xmm1
    mulss   xmm1<span style="color: #339933;">,</span> xmm3
    movaps  xmm4<span style="color: #339933;">,</span> xmm0
    mulss   xmm4<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?half@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    mulss   xmm1<span style="color: #339933;">,</span> xmm3
    subss   xmm5<span style="color: #339933;">,</span> xmm4
    mulss   xmm1<span style="color: #339933;">,</span> xmm5      <span style="color: #666666; font-style: italic;">; Newton-Raphson finishes here; 1/sqrt(dot) is in xmm1's low word</span>
&nbsp;
    shufps  xmm1<span style="color: #339933;">,</span> xmm1<span style="color: #339933;">,</span> <span style="color: #0000ff;">0</span>   <span style="color: #666666; font-style: italic;">; broadcast so that xmm1 has 1/sqrt(dot) in all words</span>
    movaps  xmm3<span style="color: #339933;">,</span> xmm1
    mulps   xmm3<span style="color: #339933;">,</span> xmm2      <span style="color: #666666; font-style: italic;">; multiply all words of original vector by 1/sqrt(dot)</span>
    movups  XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm3   <span style="color: #666666; font-style: italic;">; unaligned save to memory</span>
&nbsp;
	<span style="color: #666666; font-style: italic;">; return dot * 1 / sqrt(dot) == sqrt(dot) == length of vector</span>
    mulss   xmm0<span style="color: #339933;">,</span> xmm1
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">pop</span>     <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">ENDP</span>   <span style="color: #666666; font-style: italic;">; SSE_SIMDTestNormalizeFast</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p></div></p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Timing square root</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/#comments</comments>
		<pubDate>Fri, 16 Oct 2009 17:05:14 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234</guid>
		<description><![CDATA[The square root is one of those basic mathematical operations that&#8217;s totally ubiquitous in any game&#8217;s source code, and yet also has many competing implementations and performance superstitions around it. The compiler offers a sqrt() builtin function, and so do some CPUs, but some programmers insist on writing their own routines in software. And often [...]]]></description>
			<content:encoded><![CDATA[<p>The square root is one of those basic mathematical operations that&#8217;s totally ubiquitous in any game&#8217;s source code, and yet also has many competing implementations and performance superstitions around it. The compiler offers a sqrt() builtin function, and so do some CPUs, but some programmers insist on writing their own routines in software. And often it&#8217;s really the reciprocal square root you want, for normalizing a vector, or trigonometry. But I&#8217;ve never had a clear answer for which technique is really fastest, or exactly what accuracy-vs-speed tradeoffs we make with &#8220;estimating&#8221; intrinsics.</p>
<p>What is the fastest way to compute a square root? It would seem that if the CPU has <a href="http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc117.htm">a native square-root opcode</a>, there&#8217;s no beating the hardware, but is it really true? </p>
<p>Such questions vex me, so I went and <em>measured</em> all the different means of computing the square root of a scalar single-precision floating point number that I could think of. I ran trials on my Intel Core 2 and on the Xenon, comparing each technique for both speed and accuracy, and some of the results were surprising!</p>
<p>In this article I&#8217;ll describe my results for the Intel hardware; next week I&#8217;ll turn to the Xenon PPC.</p>
<h2>Experimental setup</h2>
<p>I&#8217;ll post the whole source code for my tests elsewhere, but basically each of these trials consists of iterating N times over an array of floating point numbers, calling square root upon each of them and writing it to a second output array. </p>
<p><a style="display:none;" id="ddetlink1824883505" href="javascript:expand(document.getElementById('ddet1824883505'))">(see pseudocode)</a>
<div class="ddet_div" id="ddet1824883505"><script language="JavaScript" type="text/javascript">expand(document.getElementById('ddet1824883505'));expand(document.getElementById('ddetlink1824883505'))</script></p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
</pre></td><td class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> TestedFunction<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> x <span style="color: #008000;">&#41;</span> 
<span style="color: #008000;">&#123;</span>
   <span style="color: #0000ff;">return</span> <span style="color: #0000dd;">sqrt</span><span style="color: #008000;">&#40;</span>x<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// one of many implementations..</span>
<span style="color: #008000;">&#125;</span>
<span style="color: #0000ff;">void</span> TimeSquareRoot<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
   <span style="color: #0000ff;">float</span> numbersIn<span style="color: #008000;">&#91;</span> ARRAYSIZE <span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>    <span style="color: #666666;">// ARRAYSIZE chosen so that both arrays </span>
   <span style="color: #0000ff;">float</span> numbersOut<span style="color: #008000;">&#91;</span> ARRAYSIZE <span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>  <span style="color: #666666;">// fit in L1 cache</span>
   <span style="color: #666666;">// assume that numbersIn is filled with random positive numbers, and both arrays are </span>
   <span style="color: #666666;">// prefetched to cache...</span>
   StartClockCycleCounter<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
   <span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> NUMITERATIONS <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i <span style="color: #008000;">&#41;</span> 
      <span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> j <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> j <span style="color: #000080;">&lt;</span> ARRAYSIZE <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>j <span style="color: #008000;">&#41;</span> <span style="color: #666666;">// in some cases I unroll this loop</span>
      <span style="color: #008000;">&#123;</span>
         numbersOut<span style="color: #008000;">&#91;</span>j<span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> TestedFunction<span style="color: #008000;">&#40;</span> numbersIn<span style="color: #008000;">&#91;</span>j<span style="color: #008000;">&#93;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> 
      <span style="color: #008000;">&#125;</span>
   StopClockCycleCounter<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
   <span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;%.3f millisec for %d floats<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>,  
             ClockCycleCounterInMilliseconds<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>, ARRAYSIZE <span style="color: #000040;">*</span> NUMITERATIONS <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> 
&nbsp;
   <span style="color: #666666;">// now measure accuracy</span>
   <span style="color: #0000ff;">float</span> error <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
   <span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> ARRAYSIZE <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i <span style="color: #008000;">&#41;</span>
   <span style="color: #008000;">&#123;</span>
       <span style="color: #0000ff;">double</span> knownAccurate <span style="color: #000080;">=</span> PerfectSquareRoot<span style="color: #008000;">&#40;</span> numbersIn<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
       error <span style="color: #000040;">+</span><span style="color: #000080;">=</span> <span style="color: #0000dd;">fabs</span><span style="color: #008000;">&#40;</span> numbersOut<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span> <span style="color: #000040;">-</span> knownAccurate  <span style="color: #008000;">&#41;</span> <span style="color: #000040;">/</span> knownAccurate <span style="color: #008080;">;</span>
   <span style="color: #008000;">&#125;</span>
   error <span style="color: #000040;">/</span><span style="color: #000080;">=</span> ARRAYSIZE <span style="color: #008080;">;</span>
   <span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;Average error: %.5f%%<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, error <span style="color: #000040;">*</span> <span style="color:#800080;">100.0f</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></td></tr></table></div>

<p></div></p>
<p>In each case I verified that the compiler was not eliding any computations (it really was performing ARRAYSIZE &times; NUMITERATIONS many square roots), that it was properly inlining the tested function, and that all the arrays fit into L1 cache so that memory latency wasn&#8217;t affecting the results. I also <strong>only tested <em>scalar</em> square root functions</strong> &mdash; SIMD would clearly be the fastest way of working on large contiguous arrays, but I wanted to measure the different techniques of computing <b>one</b> square root at a time, as is usually necessary in gameplay code. </p>
<p>Because some of the speedup techniques involve trading off accuracy, I compared the resulting numbers against the perfectly-accurate double-precision square root library routine to get an average error for each test run. </p>
<p>And I performed each run multiple times with different data, averaging the final results together.</p>
<h2>x86 results</h2>
<p>I ran my tests on a 2.66Ghz Intel Core 2 workstation. An x86 chip actually has two different means of performing scalar floating-point math. By default, the compiler uses the old <a href="http://en.wikipedia.org/wiki/X87">x87 FPU</a>, which dates back to 1980 with a stack-based instruction set like one of those old <a href="http://en.wikipedia.org/wiki/Reverse_Polish_notation">RPN calculators</a>. In 1999, Intel introduced <a href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">SSE</a>, which added a variety of new instructions to the processor. SSE is mostly thought of as a SIMD instruction set &mdash; for operating on four 32-bit floats in a single op &mdash;  but it also includes an entire set of <em>scalar</em> floating point instructions that operate on only one float at a time. It&#8217;s faster than the x87 operations and was meant to deprecate the old x87 pathway. However, both the <a href="http://msdn.microsoft.com/en-us/library/7t5yh4fd(VS.80).aspx">MSVC</a> and <a href="http://gcc.gnu.org/onlinedocs/gcc-4.0.0/gcc/i386-and-x86_002d64-Options.html">GCC</a> compilers default to exclusively using the x87 for scalar math, so unless you edit the &#8220;code generation&#8221; project properties panel (MSVC) or provide <a href="http://gcc.gnu.org/onlinedocs/gcc-4.0.0/gcc/i386-and-x86_002d64-Options.html">a cryptic obscure command line option (GCC)</a>, you&#8217;ll be stuck with code that uses the old slow way.</p>
<p>I timed the following techniques for square root:  </p>
<ol>
<li>The compiler&#8217;s built in <code>sqrt()</code> function (which compiled to the x87 FSQRT opcode)</li>
<li>T<a href="http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc300.htm">he SSE &#8220;scalar single square root&#8221; opcode <code>sqrtss</code></a>, which <a href="http://msdn.microsoft.com/en-us/vstudio/default.aspx">MSVC</a> emits if you use the <code>_mm_sqrt_ss</code> intrinsic or if you set <code>/arch:SSE2</code></li>
<li><a href="http://www.beyond3d.com/content/articles/8/">The &#8220;magic number&#8221; approximation technique</a> <a href="http://www.beyond3d.com/content/articles/15/">invented by Greg Walsh</a> at <a href="http://en.wikipedia.org/wiki/Ardent_Computer">Ardent Computer</a> and made famous by John Carmack in the Quake III source code.</li>
<li>Taking the estimated <em>reciprocal</em> square root of <i>a</i> via the SSE opcode <a href="http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc281.htm"><code>rsqrtss</code></a>, and multiplying it against <i>a</i> to get the square root via the identity x / &radic;<span style="text-decoration:overline">x</span> = &radic;<span style="text-decoration:overline">x</span>.</li>
<li>Method (4), with one additional step of <a href="http://en.wikipedia.org/wiki/Newtons_method">Newton-Raphson iteration</a> to improve accuracy.</li>
<li>Method (5), with the loop at line 13 of the pseudocode above unrolled to process four floats per iteration.</li>
</ol>
<p>I also tested three ways of getting the <em>reciprocal</em> square root: <a href="http://www.beyond3d.com/content/articles/8/">Carmack&#8217;s technique</a>, the <code>rsqrtss</code> SSE op via compiler intrinsic, and <code>rsqrtss</code> with one Newton-Raphson step.</p>
<p>The results, for 4096 loops over 4096 single-precision floats, were:</p>
<div align="center" >
<b><u>SQUARE ROOT</u></b></p>
<table border class="padded">
<tr>
<th>Method</th>
<th>Total time</th>
<th>Time per float</th>
<th>Avg Error</th>
</tr>
<tr>
<td>Compiler <code>sqrt(x)</code> /<br />x87 FPU <code>FSQRT</code></td>
<td>404.029ms</td>
<td>24ns</td>
<td>0.0000%</td>
</tr>
<tr>
<td>SSE intrinsic <code>ssqrts</code></td>
<td>200.395ms</td>
<td>11.9ns</td>
<td>0.0000%</td>
</tr>
<tr>
<td>Carmack&#8217;s Magic Number rsqrt * x</td>
<td>72.682ms</td>
<td>4.33ns</td>
<td>0.0990%</td>
</tr>
<tr>
<td>SSE <code>rsqrtss</code> * x</td>
<td>20.495ms</td>
<td>1.22ns</td>
<td>0.0094%</td>
</tr>
<tr>
<td>SSE <code>rsqrtss</code> * x<br />with one NR step</td>
<td>53.401ms</td>
<td>3.18ns</td>
<td>0.0000%</td>
</tr>
<tr>
<td>SSE <code>rsqrtss</code> * x<br />with one NR step, unrolled by four</td>
<td>48.701ms</td>
<td>2.90ns</td>
<td>0.0000%</td>
</tr>
</table>
</div>
<p></p>
<div align="center">
<b><u>RECIPROCAL SQRT</u></b></p>
<table border class="padded">
<tr>
<th>Method</th>
<th>Total time</th>
<th>Time per float</th>
<th>Avg Error</th>
<tr>
<td>Carmack&#8217;s Magic Number rsqrt </td>
<td>59.378ms</td>
<td>3.54ns</td>
<td>0.0990%</td>
</tr>
<tr>
<td>SSE <code>rsqrtss</code></td>
<td>14.202ms</td>
<td>0.85ns</td>
<td>0.0094%</td>
</tr>
<tr>
<td>SSE <code>rsqrtss</code> <br />with one NR step</td>
<td>45.952ms</td>
<td>2.74ns</td>
<td>0.0000%</td>
</tr>
<tr>
</table>
</div>
<h2>Discussion</h2>
<p>Looking at these results, it&#8217;s clear that there&#8217;s a dramatic difference in performance between different approaches to performing square root; which one you choose really can have a significant impact on framerate and accuracy. My conclusions are:</p>
<p><b>Don&#8217;t trust the compiler to do the right thing.</b> The received wisdom on performance in math functions is usually &#8220;don&#8217;t reinvent the wheel; the library and compiler are smart and optimal.&#8221; We see here that <em>this is completely wrong</em>, and in fact calling the library <code>sqrt(x)</code> causes the compiler to do exactly the worst possible thing. The compiler&#8217;s output for <code> y = sqrt(x); </code> is worse by orders of magnitude compared to any other approach tested here.</p>
<p><b>The x87 FPU is really very slow.</b>  Intel has been trying to deprecate the old x87 FPU instructions for a decade now, but no compiler in the business defaults to using the new, faster SSE scalar opcodes in place of emulating a thirty-year-old <a href="http://en.wikipedia.org/wiki/Intel_8087">8087</a>. In the case of  <code> y = sqrt(x) </code>, by default MSVC and GCC emit something like</p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #0000ff; font-weight: bold;">fld</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #009900; font-weight: bold;">&#93;</span>
<span style="color: #0000ff; font-weight: bold;">fsqrt</span>  <span style="color: #666666; font-style: italic;">;; slow x87 flop</span>
<span style="color: #0000ff; font-weight: bold;">fstp</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span></pre></div></div>

<p>But if I set the <code>/arch:SSE2</code> option flag, telling the compiler &#8220;assume this code will run on a machine with SSE2&#8243;, it will instead emit the following, which is 2x faster.</p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">sqrtss xmm0<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #009900; font-weight: bold;">&#93;</span>  <span style="color: #666666; font-style: italic;">;; faster SSE scalar flop</span>
movss <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm0</pre></div></div>

<p>There was a time when not every PC on the market had SSE2, meaning that there was some sense in using the older, more backwards-compatible operations, but that time has long since passed. SSE2 was introduced <em>in 2001 with the Pentium 4</em>. No one is ever going to try to play your game on a machine that doesn&#8217;t support it. If your customer&#8217;s PC has DirectX 9, it has SSE2.</p>
<p><b>You <em>can</em> beat the hardware.</b> The most surprising thing about these results for me was that it is faster to take a reciprocal square root and multiply it, than it is to use the native sqrt opcode, by an order of magnitude. Even Carmack&#8217;s trick, which I had assumed was obsolete in an age of deep pipelines and load-hit-stores, proved faster than the native SSE scalar op. Part of this is that the reciprocal sqrt opcode <code>rsqrtss</code> is an estimate, accurate to twelve bits; but <a href="http://mathworld.wolfram.com/NewtonsMethod.html">it only takes one step</a> of Newton&#8217;s Method to converge that estimate to an accuracy of 24 bits while still being four times faster than the hardware square root opcode.  </p>
<p>The question that then bothered me was, <em>why</em> is SSE&#8217;s built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations? The first hint came when I tried unrolling the loop so that it performed four ops inside the inner for():</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;">   <span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> NUMITERATIONS <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i <span style="color: #008000;">&#41;</span> 
      <span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> j <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> j <span style="color: #000080;">&lt;</span> ARRAYSIZE <span style="color: #008080;">;</span> j <span style="color: #000040;">+</span><span style="color: #000080;">=</span> <span style="color: #0000dd;">4</span>  <span style="color: #008000;">&#41;</span> <span style="color: #666666;">// in some cases I unroll this loop</span>
      <span style="color: #008000;">&#123;</span>
         numbersOut<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> TestedSqrt<span style="color: #008000;">&#40;</span> numbersIn<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
         numbersOut<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> TestedSqrt<span style="color: #008000;">&#40;</span> numbersIn<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> 
         numbersOut<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> TestedSqrt<span style="color: #008000;">&#40;</span> numbersIn<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> 
         numbersOut<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">3</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> TestedSqrt<span style="color: #008000;">&#40;</span> numbersIn<span style="color: #008000;">&#91;</span>j <span style="color: #000040;">+</span> <span style="color: #0000dd;">3</span><span style="color: #008000;">&#93;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>  
      <span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #666666;">// two implementations of</span></pre></div></div>

<p>As you can see from the results above, when TestedSqrt was the <code>rsqrtss</code> followed by a multiply and one step of Newton iteration, unrolling the loop this way provided a modest 8.8% improvement in speed. But when I tried the same thing with the &#8220;precise square root&#8221; op <code>sqrtss</code>, the difference was negligible:</p>
<blockquote><pre>
SSE sqrt: 200.395 msec
average error 0.0000%

SSE sqrt, unrolled four: 196.741 msec
average error 0.0000%
</pre>
</blockquote>
<p>What this suggests is that unrolling the loop this way allowed the four rsqrt paths to be <a href="http://en.wikipedia.org/wiki/Instruction_pipeline">pipelined</a>, so that while an individual <code>rsqrtss</code> might take 6 cycles to execute before its result was ready, other work could proceed during that time so that the four square root operations in the loop overlapped. On the other hand, the non-estimated <code>sqrtss</code> op apparently cannot be overlapped; one sqrt must finish before the next can begin. A look at the <a href="http://www.intel.com/products/processor/manuals/">Intel® 64 and IA-32 Architectures Optimization Reference Manual</a> confirms: <code>sqrtss</code> is an unpipelined instruction. </p>
<p><b>Pipelined operations make a big difference.</b> When the CPU hits an unpipelined instruction, every other instruction in the pipeline has to stop and wait for it to retire before proceeding, so it&#8217;s like putting the handbrake on your processor. You can identify nonpipelined operations in appendix C of the Optimization Reference Manual as the ones that have a throughput equal to latency and greater than 4 cycles.<br />
<!--<br />
Unpipelined operations on the x86 tend to be those that do too much work to fit inside about six cycles. The FSIN opcode, for example, has to compute a Taylor series, which takes many individual microoperations inside the CPU. When such an instruction is encountered, the CPU has to fetch a tiny subroutine from ROM and fully execute it before allowing any subsequent instructions to enter the pipeline.<br />
--><br />
In the case of <code>ssqrt</code>, the processor is probably doing the same thing internally that I&#8217;m doing in my &#8220;fast&#8221; function &mdash; taking an estimated reciprocal square root, improving it with Newton&#8217;s method, and then multiplying it by the input parameter.  Taken all together, this is far too much work to fit into a single execution unit, so the processor stalls until it&#8217;s all done. But if you break up the work so that each of those steps is its own instruction, then the CPU can pipeline them all, and get a much higher <em>throughput</em> even if the latency is the same.<br />
<!--<br />
It's like the difference between getting a separate washing machine and dryer so you can run two loads at once, versus one combination machine that has to wash and dry each load before starting the next one. The total time to wash any one garment from start to finish is the same, but the total volume you can process in the space of three hours is different.<br />
--><br />
Pipeline latency and microcoded instructions are <em>a much bigger</em> deal on the 360 and PS3, whose CPUs don&#8217;t reorder operations to hide bubbles; there the benefit from unrolling is much greater, as you&#8217;ll see next week. </p>
<h2>Conclusion</h2>
<p>Not all square root functions are created equal, and writing your own can have very real performance benefits over trusting the compiler to optimize your code for you (at which it fails miserably). In many cases you can trade off some accuracy for a massive increase in speed, but even in those places where you need full accuracy, writing your own function to leverage the <code>rsqrtss</code> op followed by Newton&#8217;s method can still give you 32 bits of precision at a 4x-8x improvement over what you will get with the built-in <code>sqrtf()</code> function. </p>
<p>And if you have <em>lots</em> of numbers you need to square root, of course SIMD (<code>rsqrtps</code>) will be faster still. </p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Code For Testing Virtual Function Speed</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/code-for-testing-virtual-function-speed/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/01/19/code-for-testing-virtual-function-speed/#comments</comments>
		<pubDate>Tue, 20 Jan 2009 05:26:30 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=213</guid>
		<description><![CDATA[I&#8217;ve just updated my prior article on virtual function overhead with corrected timing numbers &#8212; I hadn&#8217;t noticed that my CPU cycle counts were only 32 bits wide so timings of more than 2secs would wrap back around to zero. 
If you want to run this test on your own hardware, I&#8217;ve put my code [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve just updated my <a href="http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/">prior article on virtual function overhead</a> with corrected timing numbers &mdash; I hadn&#8217;t noticed that my CPU cycle counts were only 32 bits wide so timings of more than 2secs would wrap back around to zero. </p>
<p>If you want to run this test on your own hardware, I&#8217;ve put my code below the jump. You&#8217;ll have to build your own <code>CFastTimer</code> class, but it should be pretty clear what it does &#8212; it simply reads out of the CPU clock-cycle counter and computes a difference.</p>
<p><span id="more-213"></span></p>
<h2>file 1: class definitions header</h2>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">class</span> TestVector4_Virtual
<span style="color: #008000;">&#123;</span>
<span style="color: #0000ff;">public</span><span style="color: #008080;">:</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> GetX<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> SetX<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> GetY<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> SetY<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> GetZ<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> SetZ<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> GetW<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">virtual</span> <span style="color: #0000ff;">float</span> SetW<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #0000ff;">private</span><span style="color: #008080;">:</span>
	<span style="color: #0000ff;">float</span> x,y,z,w<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span>
&nbsp;
<span style="color: #0000ff;">class</span> TestVector4_Direct
<span style="color: #008000;">&#123;</span>
<span style="color: #0000ff;">public</span><span style="color: #008080;">:</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> GetX<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> SetX<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> GetY<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> SetY<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> GetZ<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> SetZ<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> GetW<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	__declspec<span style="color: #008000;">&#40;</span>noinline<span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">float</span> SetW<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #0000ff;">private</span><span style="color: #008080;">:</span>
	<span style="color: #0000ff;">float</span> x,y,z,w<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span>
&nbsp;
<span style="color: #0000ff;">class</span> TestVector4_Inline
<span style="color: #008000;">&#123;</span>
<span style="color: #0000ff;">public</span><span style="color: #008080;">:</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> GetX<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> SetX<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> GetY<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> SetY<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> GetZ<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> SetZ<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> GetW<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> SetW<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #0000ff;">private</span><span style="color: #008080;">:</span>
	<span style="color: #0000ff;">float</span> x,y,z,w<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span>
&nbsp;
<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> TestVector4_Inline<span style="color: #008080;">::</span><span style="color: #007788;">GetX</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">return</span> x<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> TestVector4_Inline<span style="color: #008080;">::</span><span style="color: #007788;">SetX</span><span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">return</span> x <span style="color: #000080;">=</span> in<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #ff0000; font-style: italic;">/* and so on for GetY, Z, W... */</span></pre></div></div>

<h2>file 2: class definitions cpp</h2>
<p>These functions are defined here to prevent the compiler from inlining them when they&#8217;re used.</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">float</span> TestVector4_Virtual<span style="color: #008080;">::</span><span style="color: #007788;">GetX</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">return</span> x<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
<span style="color: #0000ff;">float</span> TestVector4_Virtual<span style="color: #008080;">::</span><span style="color: #007788;">SetX</span><span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">return</span> x <span style="color: #000080;">=</span> in<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
<span style="color: #ff0000; font-style: italic;">/* and so on for y,z,w... */</span>
&nbsp;
<span style="color: #0000ff;">float</span> TestVector4_Direct<span style="color: #008080;">::</span><span style="color: #007788;">GetX</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #0000ff;">const</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">return</span> x<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
<span style="color: #0000ff;">float</span> TestVector4_Direct<span style="color: #008080;">::</span><span style="color: #007788;">SetX</span><span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> in <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">return</span> x <span style="color: #000080;">=</span> in<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
<span style="color: #ff0000; font-style: italic;">/* and so on for y,z,w... */</span></pre></div></div>

<h2>file 3: test loop</h2>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #339900;">#define ARRAY_SIZE 1024</span>
<span style="color: #339900;">#define TEST_ITERATIONS 10000</span>
&nbsp;
<span style="color: #0000ff;">template</span> <span style="color: #000080;">&lt;</span><span style="color: #0000ff;">class</span> T<span style="color: #000080;">&gt;</span>
<span style="color: #0000ff;">void</span> InitWithRandom<span style="color: #008000;">&#40;</span> T <span style="color: #000040;">*</span>ptr, <span style="color: #0000ff;">int</span> num <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">while</span><span style="color: #008000;">&#40;</span> num <span style="color: #000080;">&gt;</span> <span style="color: #0000dd;">0</span> <span style="color: #008000;">&#41;</span>
	<span style="color: #008000;">&#123;</span>
		ptr<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>SetX<span style="color: #008000;">&#40;</span> RandomFloat<span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color:#800080;">1024.<span style="color: #007788;">f</span></span>, <span style="color:#800080;">1024.0f</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		ptr<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>SetY<span style="color: #008000;">&#40;</span> RandomFloat<span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color:#800080;">1024.<span style="color: #007788;">f</span></span>, <span style="color:#800080;">1024.0f</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		ptr<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>SetZ<span style="color: #008000;">&#40;</span> RandomFloat<span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color:#800080;">1024.<span style="color: #007788;">f</span></span>, <span style="color:#800080;">1024.0f</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		ptr<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>SetW<span style="color: #008000;">&#40;</span> RandomFloat<span style="color: #008000;">&#40;</span><span style="color: #000040;">-</span><span style="color:#800080;">1024.<span style="color: #007788;">f</span></span>, <span style="color:#800080;">1024.0f</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		<span style="color: #000040;">++</span>ptr<span style="color: #008080;">;</span>
		<span style="color: #000040;">--</span>num<span style="color: #008080;">;</span>
	<span style="color: #008000;">&#125;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #0000ff;">template</span> <span style="color: #000080;">&lt;</span><span style="color: #0000ff;">class</span> T<span style="color: #000080;">&gt;</span>
<span style="color: #0000ff;">void</span> SumTest<span style="color: #008000;">&#40;</span> T <span style="color: #000040;">*</span> RESTRICT in1, T <span style="color: #000040;">*</span> RESTRICT in2, T <span style="color: #000040;">*</span> RESTRICT out, <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">int</span> num <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> num <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i <span style="color: #008000;">&#41;</span>
	<span style="color: #008000;">&#123;</span>
		out<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">SetX</span><span style="color: #008000;">&#40;</span> in1<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetX</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">+</span> in2<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetX</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		out<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">SetY</span><span style="color: #008000;">&#40;</span> in1<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetY</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">+</span> in2<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetY</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		out<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">SetZ</span><span style="color: #008000;">&#40;</span> in1<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetZ</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">+</span> in2<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetZ</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		out<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">SetW</span><span style="color: #008000;">&#40;</span> in1<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetW</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">+</span> in2<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetW</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #008000;">&#125;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #0000ff;">template</span> <span style="color: #000080;">&lt;</span><span style="color: #0000ff;">class</span> T<span style="color: #000080;">&gt;</span>
<span style="color: #0000ff;">float</span> TestTimings<span style="color: #008000;">&#40;</span> <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #666666;">// set up input and output and preheat the cache</span>
	T A<span style="color: #008000;">&#91;</span> ARRAY_SIZE <span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
	T B<span style="color: #008000;">&#91;</span> ARRAY_SIZE <span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
	T C<span style="color: #008000;">&#91;</span> ARRAY_SIZE <span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
&nbsp;
	InitWithRandom<span style="color: #008000;">&#40;</span> A , ARRAY_SIZE <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	InitWithRandom<span style="color: #008000;">&#40;</span> B , ARRAY_SIZE <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	InitWithRandom<span style="color: #008000;">&#40;</span> C , ARRAY_SIZE <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
	uint64 retval <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
	CFastTimer t1<span style="color: #008080;">;</span>
	<span style="color: #0000ff;">int</span> dontOptimizeThisLoopToNothing <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> N_ITERS <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i <span style="color: #008000;">&#41;</span>
	<span style="color: #008000;">&#123;</span>
		t1.<span style="color: #007788;">Start</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		SumTest<span style="color: #008000;">&#40;</span> A, B, C, ARRAY_SIZE <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		t1.<span style="color: #007788;">End</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		dontOptimizeThisLoopToNothing  <span style="color: #000040;">+</span><span style="color: #000080;">=</span> i<span style="color: #008080;">;</span>
		retval <span style="color: #000040;">+</span><span style="color: #000080;">=</span> t1.<span style="color: #007788;">GetClockCycleDelta</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #008000;">&#125;</span>
	<span style="color: #666666;">// force compiler to actually use the data so it doesn't vanish the loop above</span>
	<span style="color: #0000ff;">float</span> ac <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">int</span> i <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> ARRAY_SIZE <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i <span style="color: #008000;">&#41;</span>
	<span style="color: #008000;">&#123;</span>
		ac <span style="color: #000040;">+</span><span style="color: #000080;">=</span> C<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetX</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		ac <span style="color: #000040;">+</span><span style="color: #000080;">=</span> C<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetY</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		ac <span style="color: #000040;">+</span><span style="color: #000080;">=</span> C<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetZ</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		ac <span style="color: #000040;">+</span><span style="color: #000080;">=</span> C<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetW</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #008000;">&#125;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;%f %d<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, ac, dontOptimizeThisLoopToNothing  <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// just ignore these</span>
	<span style="color: #0000ff;">return</span> CyclesToMilliseconds<span style="color: #008000;">&#40;</span>retval<span style="color: #008000;">&#41;</span> <span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #0000ff;">void</span> RunTest<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	<span style="color: #666666;">// get timings for each type</span>
	<span style="color: #0000ff;">float</span> tVirt, tDirect, tInline<span style="color: #008080;">;</span>
	tVirt <span style="color: #000080;">=</span> TestTimings<span style="color: #000080;">&lt;</span> TestVector4_Virtual <span style="color: #000080;">&gt;</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	tDirect <span style="color: #000080;">=</span> TestTimings<span style="color: #000080;">&lt;</span> TestVector4_Direct <span style="color: #000080;">&gt;</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	tInline <span style="color: #000080;">=</span> TestTimings<span style="color: #000080;">&lt;</span> TestVector4_Inline <span style="color: #000080;">&gt;</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>%d iterations over %d vectors<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, TEST_ITERATIONS , ARRAY_SIZE <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;virtual: %.3f ms<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, tVirt <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;direct: %.3f ms<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, tDirect <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span> <span style="color: #FF0000;">&quot;inline: %.3f ms<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, tInline <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<h2>Assembly output</h2>
<p>And, just in case you&#8217;re curious, here&#8217;s the assembly the compiler generates for the different versions of <code>SumTest</code>:</p>
<h4>Direct Function</h4>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">; Begin code for function: ??$SumTest@VTestVector4_Direct@@@@YAXPIAVTestVector4_Direct@@00H@Z</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 58   : {</span>
&nbsp;
	mflr         r12
	<span style="color: #00007f;">bl</span>           __savegprlr_26
	stfd         fr31<span style="color: #339933;">,-</span><span style="color: #0000ff;">40h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	stwu         r1<span style="color: #339933;">,-</span><span style="color: #0000ff;">90h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
<span style="color: #339933;">.</span>endprolog
$M89780<span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 59   : 	for ( int i = 0; i &lt; num ; ++i )</span>
&nbsp;
	cmpwi        cr6<span style="color: #339933;">,</span>r6<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span>
	ble          cr6<span style="color: #339933;">,</span>$LN1@SumTest@<span style="color: #0000ff;">2</span>
	mr           r31<span style="color: #339933;">,</span>r4
	subf         r27<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r3
	subf         r26<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r5
	mr           r28<span style="color: #339933;">,</span>r6
$LL3@SumTest@<span style="color: #0000ff;">2</span><span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 60   : 	{</span>
<span style="color: #666666; font-style: italic;">; 61   : 		out[i].SetX( in1[i].GetX() + in2[i].GetX() );</span>
&nbsp;
	<span style="color: #00007f; font-weight: bold;">add</span>          r30<span style="color: #339933;">,</span>r27<span style="color: #339933;">,</span>r31
	<span style="color: #00007f; font-weight: bold;">add</span>          r29<span style="color: #339933;">,</span>r26<span style="color: #339933;">,</span>r31
	mr           r3<span style="color: #339933;">,</span>r30
	<span style="color: #00007f;">bl</span>           ?GetX@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r31
	fmr          fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?GetX@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r29
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?SetX@TestVector4_Direct@@QAAMM@Z
&nbsp;
<span style="color: #666666; font-style: italic;">; 62   : 		out[i].SetY( in1[i].GetY() + in2[i].GetY() );</span>
&nbsp;
	mr           r3<span style="color: #339933;">,</span>r30
	<span style="color: #00007f;">bl</span>           ?GetY@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r31
	fmr          fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?GetY@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r29
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?SetY@TestVector4_Direct@@QAAMM@Z
&nbsp;
<span style="color: #666666; font-style: italic;">; 63   : 		out[i].SetZ( in1[i].GetZ() + in2[i].GetZ() );</span>
&nbsp;
	mr           r3<span style="color: #339933;">,</span>r30
	<span style="color: #00007f;">bl</span>           ?GetZ@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r31
	fmr          fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?GetZ@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r29
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?<span style="color: #00007f; font-weight: bold;">SetZ</span>@TestVector4_Direct@@QAAMM@Z
&nbsp;
<span style="color: #666666; font-style: italic;">; 64   : 		out[i].SetW( in1[i].GetW() + in2[i].GetW() );</span>
&nbsp;
	mr           r3<span style="color: #339933;">,</span>r30
	<span style="color: #00007f;">bl</span>           ?GetW@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r31
	fmr          fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?GetW@TestVector4_Direct@@QBAMXZ
	mr           r3<span style="color: #339933;">,</span>r29
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f;">bl</span>           ?SetW@TestVector4_Direct@@QAAMM@Z
	addic<span style="color: #339933;">.</span>       r28<span style="color: #339933;">,</span>r28<span style="color: #339933;">,-</span><span style="color: #0000ff;">1</span>		<span style="color: #666666; font-style: italic;">; 0FFFFh</span>
	addi         r31<span style="color: #339933;">,</span>r31<span style="color: #339933;">,</span><span style="color: #0000ff;">16</span>		<span style="color: #666666; font-style: italic;">; 10h</span>
	bne          $LL3@SumTest@<span style="color: #0000ff;">2</span>
$LN1@SumTest@<span style="color: #0000ff;">2</span><span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 65   : 	}</span>
<span style="color: #666666; font-style: italic;">; 66   : }</span>
&nbsp;
	addi         r1<span style="color: #339933;">,</span>r1<span style="color: #339933;">,</span><span style="color: #0000ff;">144</span>		<span style="color: #666666; font-style: italic;">; 90h</span>
	lfd          fr31<span style="color: #339933;">,-</span><span style="color: #0000ff;">40h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	b            __restgprlr_26
$M89781<span style="color: #339933;">:</span>
<span style="color: #666666; font-style: italic;">; End code for function: ??$SumTest@VTestVector4_Direct@@@@YAXPIAVTestVector4_Direct@@00H@Z</span></pre></div></div>

<h4>Virtual Function</h4>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">??$SumTest@VTestVector4_Virtual@@@@YAXPIAVTestVector4_Virtual@@<span style="color: #0000ff;">00H</span>@Z <span style="color: #000000; font-weight: bold;">PROC</span> <span style="color: #000000; font-weight: bold;">NEAR</span> <span style="color: #666666; font-style: italic;">; SumTest&lt;TestVector4_Virtual&gt;, COMDAT</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; Begin code for function: ??$SumTest@VTestVector4_Virtual@@@@YAXPIAVTestVector4_Virtual@@00H@Z</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 58   : {</span>
&nbsp;
	mflr         r12
	<span style="color: #00007f;">bl</span>           __savegprlr_25
	stfd         fr31<span style="color: #339933;">,-</span><span style="color: #0000ff;">48h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	stwu         r1<span style="color: #339933;">,-</span><span style="color: #0000ff;">0A0h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
<span style="color: #339933;">.</span>endprolog
$M89754<span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 59   : 	for ( int i = 0; i &lt; num ; ++i )</span>
&nbsp;
	cmpwi        cr6<span style="color: #339933;">,</span>r6<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span>
	ble          cr6<span style="color: #339933;">,</span>$LN1@SumTest
	mr           r31<span style="color: #339933;">,</span>r4
	subf         r30<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r3
	subf         r29<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r5
	mr           r26<span style="color: #339933;">,</span>r6
$LL3@SumTest<span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 60   : 	{</span>
<span style="color: #666666; font-style: italic;">; 61   : 		out[i].SetX( in1[i].GetX() + in2[i].GetX() );</span>
&nbsp;
	lwz          r11<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r31<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">add</span>          r28<span style="color: #339933;">,</span>r29<span style="color: #339933;">,</span>r31
	lwzx         r25<span style="color: #339933;">,</span>r29<span style="color: #339933;">,</span>r31
	<span style="color: #00007f; font-weight: bold;">add</span>          r27<span style="color: #339933;">,</span>r30<span style="color: #339933;">,</span>r31
	mr           r3<span style="color: #339933;">,</span>r31
	lwz          r10<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	mtctr        r10
	bctrl
	lwzx         r9<span style="color: #339933;">,</span>r30<span style="color: #339933;">,</span>r31
	mr           r3<span style="color: #339933;">,</span>r27
	fmr          fr31<span style="color: #339933;">,</span>fr1
	lwz          r8<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	mtctr        r8
	bctrl
	lwz          r7<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r25<span style="color: #009900; font-weight: bold;">&#41;</span>
	mr           r3<span style="color: #339933;">,</span>r28
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	mtctr        r7
	bctrl
&nbsp;
<span style="color: #666666; font-style: italic;">; 62   : 		out[i].SetY( in1[i].GetY() + in2[i].GetY() );</span>
&nbsp;
	lwz          r6<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r31<span style="color: #009900; font-weight: bold;">&#41;</span>
	mr           r3<span style="color: #339933;">,</span>r31
	lwz          r5<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r6<span style="color: #009900; font-weight: bold;">&#41;</span>
	lwzx         r25<span style="color: #339933;">,</span>r29<span style="color: #339933;">,</span>r31
	mtctr        r5
	bctrl
	lwzx         r4<span style="color: #339933;">,</span>r30<span style="color: #339933;">,</span>r31
	mr           r3<span style="color: #339933;">,</span>r27
	fmr          fr31<span style="color: #339933;">,</span>fr1
	lwz          r11<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4<span style="color: #009900; font-weight: bold;">&#41;</span>
	mtctr        r11
	bctrl
	lwz          r10<span style="color: #339933;">,</span><span style="color: #0000ff;">0Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r25<span style="color: #009900; font-weight: bold;">&#41;</span>
	mr           r3<span style="color: #339933;">,</span>r28
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	mtctr        r10
	bctrl
&nbsp;
<span style="color: #666666; font-style: italic;">; 63   : 		out[i].SetZ( in1[i].GetZ() + in2[i].GetZ() );</span>
&nbsp;
	lwz          r9<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r31<span style="color: #009900; font-weight: bold;">&#41;</span>
	mr           r3<span style="color: #339933;">,</span>r31
	lwz          r8<span style="color: #339933;">,</span><span style="color: #0000ff;">10h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	lwzx         r25<span style="color: #339933;">,</span>r29<span style="color: #339933;">,</span>r31
	mtctr        r8
	bctrl
	lwzx         r7<span style="color: #339933;">,</span>r30<span style="color: #339933;">,</span>r31
	mr           r3<span style="color: #339933;">,</span>r27
	fmr          fr31<span style="color: #339933;">,</span>fr1
	lwz          r6<span style="color: #339933;">,</span><span style="color: #0000ff;">10h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r7<span style="color: #009900; font-weight: bold;">&#41;</span>
	mtctr        r6
	bctrl
	lwz          r5<span style="color: #339933;">,</span><span style="color: #0000ff;">14h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r25<span style="color: #009900; font-weight: bold;">&#41;</span>
	mr           r3<span style="color: #339933;">,</span>r28
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	mtctr        r5
	bctrl
&nbsp;
<span style="color: #666666; font-style: italic;">; 64   : 		out[i].SetW( in1[i].GetW() + in2[i].GetW() );</span>
&nbsp;
	lwz          r4<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r31<span style="color: #009900; font-weight: bold;">&#41;</span>
	mr           r3<span style="color: #339933;">,</span>r31
	lwz          r11<span style="color: #339933;">,</span><span style="color: #0000ff;">18h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r4<span style="color: #009900; font-weight: bold;">&#41;</span>
	lwzx         r25<span style="color: #339933;">,</span>r29<span style="color: #339933;">,</span>r31
	mtctr        r11
	bctrl
	lwzx         r10<span style="color: #339933;">,</span>r30<span style="color: #339933;">,</span>r31
	fmr          fr31<span style="color: #339933;">,</span>fr1
	mr           r3<span style="color: #339933;">,</span>r27
	lwz          r9<span style="color: #339933;">,</span><span style="color: #0000ff;">18h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	mtctr        r9
	bctrl
	lwz          r8<span style="color: #339933;">,</span><span style="color: #0000ff;">1Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r25<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr1<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr1
	mr           r3<span style="color: #339933;">,</span>r28
	mtctr        r8
	bctrl
	addic<span style="color: #339933;">.</span>       r26<span style="color: #339933;">,</span>r26<span style="color: #339933;">,-</span><span style="color: #0000ff;">1</span>		<span style="color: #666666; font-style: italic;">; 0FFFFh</span>
	addi         r31<span style="color: #339933;">,</span>r31<span style="color: #339933;">,</span><span style="color: #0000ff;">20</span>		<span style="color: #666666; font-style: italic;">; 14h</span>
	bne          $LL3@SumTest
$LN1@SumTest<span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 65   : 	}</span>
<span style="color: #666666; font-style: italic;">; 66   : }</span>
&nbsp;
	addi         r1<span style="color: #339933;">,</span>r1<span style="color: #339933;">,</span><span style="color: #0000ff;">160</span>		<span style="color: #666666; font-style: italic;">; 0A0h</span>
	lfd          fr31<span style="color: #339933;">,-</span><span style="color: #0000ff;">48h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	b            __restgprlr_25
$M89755<span style="color: #339933;">:</span>
<span style="color: #666666; font-style: italic;">; End code for function: ??$SumTest@VTestVector4_Virtual@@@@YAXPIAVTestVector4_Virtual@@00H@Z</span></pre></div></div>

<h4>Inlined Function</h4>
<p> (notice the use of software pipelining to reduce <a href="http://en.wikipedia.org/wiki/Hazard_(computer_architecture)">hazards</a>)</p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">; Begin code for function: ??$SumTest@VTestVector4_Inline@@@@YAXPIAVTestVector4_Inline@@00H@Z</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 58   : {</span>
&nbsp;
	mflr         r12
	<span style="color: #00007f;">bl</span>           __savegprlr_29
	stfd         fr29<span style="color: #339933;">,-</span><span style="color: #0000ff;">38h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfd         fr30<span style="color: #339933;">,-</span><span style="color: #0000ff;">30h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfd         fr31<span style="color: #339933;">,-</span><span style="color: #0000ff;">28h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
<span style="color: #339933;">.</span>endprolog
$M89879<span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 59   : 	for ( int i = 0; i &lt; num ; ++i )</span>
&nbsp;
	li           r7<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span>
	cmpwi        cr6<span style="color: #339933;">,</span>r6<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span>
	blt          cr6<span style="color: #339933;">,</span>$LC33@SumTest@<span style="color: #0000ff;">3</span>
	addi         r11<span style="color: #339933;">,</span>r6<span style="color: #339933;">,-</span><span style="color: #0000ff;">4</span>		<span style="color: #666666; font-style: italic;">; 0FFFCh</span>
	addi         r9<span style="color: #339933;">,</span>r3<span style="color: #339933;">,</span><span style="color: #0000ff;">16</span>		<span style="color: #666666; font-style: italic;">; 10h</span>
	srwi         r11<span style="color: #339933;">,</span>r11<span style="color: #339933;">,</span><span style="color: #0000ff;">2</span>
	addi         r10<span style="color: #339933;">,</span>r5<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span>
	addi         r8<span style="color: #339933;">,</span>r11<span style="color: #339933;">,</span><span style="color: #0000ff;">1</span>
	addi         r11<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 64   : 		out[i].SetW( in1[i].GetW() + in2[i].GetW() );</span>
&nbsp;
	subf         r31<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r3
	subf         r30<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r5
	subf         r29<span style="color: #339933;">,</span>r5<span style="color: #339933;">,</span>r3
	slwi         r7<span style="color: #339933;">,</span>r8<span style="color: #339933;">,</span><span style="color: #0000ff;">2</span>
$LL34@SumTest@<span style="color: #0000ff;">3</span><span style="color: #339933;">:</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr0<span style="color: #339933;">,-</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	addic<span style="color: #339933;">.</span>       r8<span style="color: #339933;">,</span>r8<span style="color: #339933;">,-</span><span style="color: #0000ff;">1</span>		<span style="color: #666666; font-style: italic;">; 0FFFFh</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr13<span style="color: #339933;">,-</span><span style="color: #0000ff;">10h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	lfsx         fr12<span style="color: #339933;">,</span>r31<span style="color: #339933;">,</span>r11
	fadds        fr11<span style="color: #339933;">,</span>fr0<span style="color: #339933;">,</span>fr13
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr10<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr8<span style="color: #339933;">,</span>fr12<span style="color: #339933;">,</span>fr10
	lfsx         fr9<span style="color: #339933;">,</span>r10<span style="color: #339933;">,</span>r29
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr7<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr6<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr5<span style="color: #339933;">,</span>fr9<span style="color: #339933;">,</span>fr7
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr4<span style="color: #339933;">,-</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">0Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr2<span style="color: #339933;">,</span>fr6<span style="color: #339933;">,</span>fr4
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr1<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr0<span style="color: #339933;">,</span><span style="color: #0000ff;">10h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr13<span style="color: #339933;">,</span>fr3<span style="color: #339933;">,</span>fr1
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr12<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr10<span style="color: #339933;">,</span><span style="color: #0000ff;">14h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr9<span style="color: #339933;">,</span>fr0<span style="color: #339933;">,</span>fr12
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr7<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr6<span style="color: #339933;">,</span><span style="color: #0000ff;">18h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr4<span style="color: #339933;">,</span>fr10<span style="color: #339933;">,</span>fr7
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">0Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr1<span style="color: #339933;">,</span><span style="color: #0000ff;">1Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr0<span style="color: #339933;">,</span>fr6<span style="color: #339933;">,</span>fr3
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr12<span style="color: #339933;">,</span><span style="color: #0000ff;">10h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr10<span style="color: #339933;">,</span><span style="color: #0000ff;">20h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr7<span style="color: #339933;">,</span>fr1<span style="color: #339933;">,</span>fr12
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr6<span style="color: #339933;">,</span><span style="color: #0000ff;">14h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">24h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr1<span style="color: #339933;">,</span>fr10<span style="color: #339933;">,</span>fr6
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr12<span style="color: #339933;">,</span><span style="color: #0000ff;">18h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr6<span style="color: #339933;">,</span>fr3<span style="color: #339933;">,</span>fr12
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">1Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr10<span style="color: #339933;">,</span><span style="color: #0000ff;">28h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr10<span style="color: #339933;">,</span>fr10<span style="color: #339933;">,</span>fr3
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">20h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr12<span style="color: #339933;">,</span><span style="color: #0000ff;">2Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr12<span style="color: #339933;">,</span>fr12<span style="color: #339933;">,</span>fr3
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr31<span style="color: #339933;">,</span><span style="color: #0000ff;">30h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">24h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr3<span style="color: #339933;">,</span>fr31<span style="color: #339933;">,</span>fr3
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr30<span style="color: #339933;">,</span><span style="color: #0000ff;">34h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr31<span style="color: #339933;">,</span><span style="color: #0000ff;">28h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr31<span style="color: #339933;">,</span>fr30<span style="color: #339933;">,</span>fr31
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr29<span style="color: #339933;">,</span><span style="color: #0000ff;">38h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr30<span style="color: #339933;">,</span><span style="color: #0000ff;">2Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	addi         r9<span style="color: #339933;">,</span>r9<span style="color: #339933;">,</span><span style="color: #0000ff;">64</span>		<span style="color: #666666; font-style: italic;">; 40h</span>
	fadds        fr30<span style="color: #339933;">,</span>fr29<span style="color: #339933;">,</span>fr30
	stfs         fr11<span style="color: #339933;">,-</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfsx        fr8<span style="color: #339933;">,</span>r30<span style="color: #339933;">,</span>r11
	addi         r11<span style="color: #339933;">,</span>r11<span style="color: #339933;">,</span><span style="color: #0000ff;">64</span>		<span style="color: #666666; font-style: italic;">; 40h</span>
	stfs         fr5<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr2<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr13<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr9<span style="color: #339933;">,</span><span style="color: #0000ff;">0Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr4<span style="color: #339933;">,</span><span style="color: #0000ff;">10h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr0<span style="color: #339933;">,</span><span style="color: #0000ff;">14h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr7<span style="color: #339933;">,</span><span style="color: #0000ff;">18h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr1<span style="color: #339933;">,</span><span style="color: #0000ff;">1Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr6<span style="color: #339933;">,</span><span style="color: #0000ff;">20h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr10<span style="color: #339933;">,</span><span style="color: #0000ff;">24h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr12<span style="color: #339933;">,</span><span style="color: #0000ff;">28h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">2Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr31<span style="color: #339933;">,</span><span style="color: #0000ff;">30h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr30<span style="color: #339933;">,</span><span style="color: #0000ff;">34h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	addi         r10<span style="color: #339933;">,</span>r10<span style="color: #339933;">,</span><span style="color: #0000ff;">64</span>		<span style="color: #666666; font-style: italic;">; 40h</span>
	bne          $LL34@SumTest@<span style="color: #0000ff;">3</span>
$LC33@SumTest@<span style="color: #0000ff;">3</span><span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 59   : 	for ( int i = 0; i &lt; num ; ++i )</span>
&nbsp;
	cmpw         cr6<span style="color: #339933;">,</span>r7<span style="color: #339933;">,</span>r6
	bge          cr6<span style="color: #339933;">,</span>$LN32@SumTest@<span style="color: #0000ff;">3</span>
	slwi         r11<span style="color: #339933;">,</span>r7<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span>
	subf         r31<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r3
	<span style="color: #00007f; font-weight: bold;">add</span>          r8<span style="color: #339933;">,</span>r11<span style="color: #339933;">,</span>r4
	<span style="color: #00007f; font-weight: bold;">add</span>          r10<span style="color: #339933;">,</span>r11<span style="color: #339933;">,</span>r5
	<span style="color: #00007f; font-weight: bold;">add</span>          r9<span style="color: #339933;">,</span>r11<span style="color: #339933;">,</span>r3
	addi         r11<span style="color: #339933;">,</span>r8<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span>
	subf         r4<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r5
	addi         r10<span style="color: #339933;">,</span>r10<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span>
	subf         r5<span style="color: #339933;">,</span>r5<span style="color: #339933;">,</span>r3
	subf         r8<span style="color: #339933;">,</span>r7<span style="color: #339933;">,</span>r6
$LC3@SumTest@<span style="color: #0000ff;">3</span><span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 60   : 	{</span>
<span style="color: #666666; font-style: italic;">; 61   : 		out[i].SetX( in1[i].GetX() + in2[i].GetX() );</span>
&nbsp;
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr0<span style="color: #339933;">,-</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	addic<span style="color: #339933;">.</span>       r8<span style="color: #339933;">,</span>r8<span style="color: #339933;">,-</span><span style="color: #0000ff;">1</span>		<span style="color: #666666; font-style: italic;">; 0FFFFh</span>
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr13<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 62   : 		out[i].SetY( in1[i].GetY() + in2[i].GetY() );</span>
&nbsp;
	lfsx         fr12<span style="color: #339933;">,</span>r31<span style="color: #339933;">,</span>r11
	fadds        fr11<span style="color: #339933;">,</span>fr0<span style="color: #339933;">,</span>fr13
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr10<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 63   : 		out[i].SetZ( in1[i].GetZ() + in2[i].GetZ() );</span>
&nbsp;
	lfsx         fr9<span style="color: #339933;">,</span>r10<span style="color: #339933;">,</span>r5
	fadds        fr8<span style="color: #339933;">,</span>fr12<span style="color: #339933;">,</span>fr10
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr7<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 64   : 		out[i].SetW( in1[i].GetW() + in2[i].GetW() );</span>
&nbsp;
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr6<span style="color: #339933;">,</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r11<span style="color: #009900; font-weight: bold;">&#41;</span>
	fadds        fr5<span style="color: #339933;">,</span>fr9<span style="color: #339933;">,</span>fr7
	<span style="color: #00007f; font-weight: bold;">lfs</span>          fr4<span style="color: #339933;">,</span><span style="color: #0000ff;">0Ch</span><span style="color: #009900; font-weight: bold;">&#40;</span>r9<span style="color: #009900; font-weight: bold;">&#41;</span>
	addi         r9<span style="color: #339933;">,</span>r9<span style="color: #339933;">,</span><span style="color: #0000ff;">16</span>		<span style="color: #666666; font-style: italic;">; 10h</span>
	fadds        fr3<span style="color: #339933;">,</span>fr6<span style="color: #339933;">,</span>fr4
	stfs         fr11<span style="color: #339933;">,-</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfsx        fr8<span style="color: #339933;">,</span>r4<span style="color: #339933;">,</span>r11
	addi         r11<span style="color: #339933;">,</span>r11<span style="color: #339933;">,</span><span style="color: #0000ff;">16</span>		<span style="color: #666666; font-style: italic;">; 10h</span>
	stfs         fr5<span style="color: #339933;">,</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	stfs         fr3<span style="color: #339933;">,</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#40;</span>r10<span style="color: #009900; font-weight: bold;">&#41;</span>
	addi         r10<span style="color: #339933;">,</span>r10<span style="color: #339933;">,</span><span style="color: #0000ff;">16</span>		<span style="color: #666666; font-style: italic;">; 10h</span>
	bne          $LC3@SumTest@<span style="color: #0000ff;">3</span>
$LN32@SumTest@<span style="color: #0000ff;">3</span><span style="color: #339933;">:</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 65   : 	}</span>
<span style="color: #666666; font-style: italic;">; 66   : }</span>
&nbsp;
	lfd          fr29<span style="color: #339933;">,-</span><span style="color: #0000ff;">38h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	lfd          fr30<span style="color: #339933;">,-</span><span style="color: #0000ff;">30h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	lfd          fr31<span style="color: #339933;">,-</span><span style="color: #0000ff;">28h</span><span style="color: #009900; font-weight: bold;">&#40;</span>r1<span style="color: #009900; font-weight: bold;">&#41;</span>
	b            __restgprlr_29
$M89880<span style="color: #339933;">:</span>
<span style="color: #666666; font-style: italic;">; End code for function: ??$SumTest@VTestVector4_Inline@@@@YAXPIAVTestVector4_Inline@@00H@Z</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/01/19/code-for-testing-virtual-function-speed/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How Slow Are Virtual Functions Really?</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/#comments</comments>
		<pubDate>Mon, 19 Jan 2009 17:30:41 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181</guid>
		<description><![CDATA[Whenever I work with virtual functions I find myself wondering: how much is it costing me to perform all these vtable lookups and indirect calls? The usual truism is that computers are so fast now that it doesn&#8217;t matter and that the idea of virtuals being a problem is just another myth.
Our beloved Xenon CPU [...]]]></description>
			<content:encoded><![CDATA[<p>Whenever I work with <a href="http://www.parashift.com/c++-faq-lite/virtual-functions.html">virtual functions</a> I find myself wondering: how much is it costing me to perform all these <a href="http://en.wikipedia.org/wiki/Virtual_table">vtable</a> lookups and indirect calls? <a href="http://hbfs.wordpress.com/2008/12/30/the-true-cost-of-calls/">The usual truism is that computers are so fast now that it doesn&#8217;t matter</a> and that the idea of virtuals being a problem is just another myth.</p>
<p>Our beloved Xenon CPU is in-order, however, so I got curious whether that myth is truly busted for us, and as any Mythbuster can tell you, the only way to know is to build it and test! </p>
<p>I&#8217;ll talk about the test results first and then try to explain them in a later article. I built a simple 4-dimensional vector class with accessor functions for x,y,z, and w. Then I set up three arrays (A, B, C) each containing 1024 of these classes (so everything fits into the L1 cache) and ran a loop that simply added them together one component at a time.</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">class</span> Vector4Test <span style="color: #008000;">&#123;</span>
  <span style="color: #0000ff;">float</span> x,y,z,w<span style="color: #008080;">;</span>
<span style="color: #0000ff;">public</span><span style="color: #008080;">:</span>
  <span style="color: #0000ff;">float</span> GetX<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#123;</span> <span style="color: #0000ff;">return</span> x<span style="color: #008080;">;</span> <span style="color: #008000;">&#93;</span>
  <span style="color: #0000ff;">float</span> SetX<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> x_ <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#123;</span> <span style="color: #0000ff;">return</span> x<span style="color: #000080;">=</span>x_<span style="color: #008080;">;</span> <span style="color: #008000;">&#125;</span>
  <span style="color: #666666;">// and so on</span>
<span style="color: #008000;">&#125;</span>
Vector4Test A<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1024</span><span style="color: #008000;">&#93;</span>, B<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1024</span><span style="color: #008000;">&#93;</span>, C<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1024</span><span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
&nbsp;
<span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> n <span style="color: #000080;">=</span> <span style="color: #0000dd;">0</span> <span style="color: #008080;">;</span> n <span style="color: #000080;">=</span> NUM_TESTS <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>n<span style="color: #008000;">&#41;</span>
<span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> i<span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> <span style="color: #0000dd;">1024</span> <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i<span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#123;</span>
   C<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">SetX</span><span style="color: #008000;">&#40;</span> A<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetX</span> <span style="color: #000040;">+</span> <span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span> B<span style="color: #008000;">&#91;</span>i<span style="color: #008000;">&#93;</span>.<span style="color: #007788;">GetX</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
   <span style="color: #666666;">// and so on for y, z, and w</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p>By specifying whether the <code>Get</code> and <code>Set</code> functions are inline, direct, or virtual, it&#8217;s easy to compare the overhead of one kind of function call versus another. Each run through the loop would make three function calls per component times four components times 1024 elements in the array for a total of 12,288 function calls. The inline function is essentially the control group since it measures just the cost of the memory accesses, loop conditionals, and floating-point math without any function call overhead at all. Here&#8217;s the results:</p>
<div style="width: 40em; margin-left: auto; margin-right: auto; background-color: #FFFFE0; padding:0.5em; border:medium solid black"><b>NOTE:</b> The values below have been corrected from the first version of this post. See <a href="http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/?preview=true&#038;preview_id=181&#038;preview_nonce=568a554f6d#comment-74">this comment</a> for details.</div>
<p><center><br />
<table >
<tr>
<th colspan="2" style="font-weight:bold; white-space: nowrap;">1000 iterations over 1024 vectors<br />12,288,000 function calls</th>
</tr>
<tr>
<td>virtual:</td>
<td>159.856 ms</td>
</tr>
<tr>
<td>direct:</td>
<td>67.962 ms</td>
</tr>
<tr>
<td>inline:</td>
<td>8.040 ms</td>
</tr>
</table>
<p>&nbsp;</p>
<table >
<tr>
<th colspan="2" style="font-weight:bold; white-space: nowrap;">50000 iterations over 1024 vectors<br />614,400,000 function calls</th>
</tr>
<tr>
<td>virtual:</td>
<td>8080.708 ms</td>
</tr>
<tr>
<td>direct:</td>
<td>3406.297 ms</td>
</tr>
<tr>
<td>inline:</td>
<td>411.924 ms</td>
</tr>
</table>
<p></center></p>
<p>A couple of things are immediately obvious. First, <strong>virtual functions <em>are</em> slower than direct function calls</strong>. But by how much? In the upper trial, the virtual-function test took 91.894ms longer than the direct functions; divided by the 12.288&times;10<sup>6</sup> function calls, that works out a differential overhead of about 7 <i>nano</i>seconds. So, there is a definite cost there, but probably not something to worry about unless it&#8217;s a function that gets called thousands of times per frame.</p>
<p>Later I&#8217;ll get further into the causes of these disparities, why virtual functions are slower than direct calls, and when inlining is advantageous. In the meantime I can tell you for sure that the problem is <em>not</em> the cost of looking up the indirect function pointer from the vtable &mdash; that&#8217;s only a single unproblematic load operation. Rather the issues lie in <a href="http://users.cs.fiu.edu/~downeyt/cop3402/prediction.html">branch prediction</a> and the way that marshalling parameters for the <a href="http://developer.apple.com/documentation/developertools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20">calling convention</a> can get in the way of good instruction scheduling.</p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>More About Rounding On MSVC-x86</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/13/sse2-fastest-rounding-on-x86/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/01/13/sse2-fastest-rounding-on-x86/#comments</comments>
		<pubDate>Tue, 13 Jan 2009 21:32:45 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=166</guid>
		<description><![CDATA[For best float-to-int conversion speed, specify the compiler flag /arch:SSE2 to assume the availability of SSE2 opcodes.]]></description>
			<content:encoded><![CDATA[<p>The feedback I got on <a href="http://assemblyrequired.crashworks.org/2009/01/12/why-you-should-never-cast-floats-to-ints/">yesterday&#8217;s article on float-to-int conversion</a> prompted me to look more closely into all the different options MSVC actually gives you for rounding on the x86 architecture. It turns out that with <code><a href="http://msdn.microsoft.com/en-us/library/e7s85ffb(VS.80).aspx">/fp:fast</a></code> set it can do one of three things (in addition to the <a href="http://www.stereopsis.com/sree/fpu2006.html">magic-number rounding</a> you can write yourself):</p>
<ul>
<li>By default it will call a function <code>_ftol2_sse</code>, which tests the CPU to see if it has <a href="http://softpixel.com/~cwright/programming/simd/sse2.php">SSE2</a> functionality. If so, it uses the native SSE2 instruction <tt>cvttsd2si</tt>. If not, it calls _ftol(). This is quite slow because it has to perform that CPU test for every single conversion, and because there is that overhead of a function call.</li>
<li>With <code><a href="http://msdn.microsoft.com/en-us/library/6d9xx1d2(VS.80).aspx">/QIfist</a></code> specified, the compiler simply emits a <code>fistp</code> opcode to convert the x87 floating point register to an integer in memory directly. It uses whatever rounding mode happens to be set in the CPU at the moment.</li>
<li>With <code><a href="http://msdn.microsoft.com/en-us/library/7t5yh4fd(VS.80).aspx">/arch:SSE2</a></code> specified, the compiler assumes that the program will only run on CPUs with SSE2, so it emits the <tt>cvttsd2si</tt> opcode directly instead of calling <code>_ftol2_sse</code>. Like /QIfist, this replaces a function call with a single instruction, but it&#8217;s even faster and not deprecated. As commenter <strong>cb</strong> points out, the intrinsics also let you specify truncation or rounding without having to fool around with CPU modes.</li>
</ul>
<p>I raced the different techniques against each other and the clear winner was the function compiled with <code>/arch:SSE2</code> set. Thus, if you can assume that your customer will have a CPU with SSE2 enabled, setting that simple compiler switch will provide you with superior performance for basically no work. The only caveat is that the SSE scalar operations operate at a maximum of double-precision floats, whereas the old x87 FPU instructions are internally 80-bit &mdash; but I&#8217;ve never seen a game application where that level of precision makes a difference.</p>
<p><a href="http://store.steampowered.com/hwsurvey/">According to the Steam Hardware Survey</a>, 95% of our customers have SSE2-capable CPUs. The rest are probably not playing your most recent releases anyway.</p>
<p><center><br />
<table border>
<caption>Comparison of rounding speeds<br/>8 trials of 1.024*10<sup>8</sup> floats on a Core2</caption>
<tr>
<th>/fp:fast</th>
<th>magic number</th>
<th>/arch:sse2</th>
<th>/QIfist</th>
</tr>
<tr>
<td>312.944ms</td>
<td>184.534ms</td>
<td>96.978ms</td>
<td>178.732ms</td>
</tr>
<tr>
<td>314.255ms</td>
<td>182.105ms</td>
<td>91.390ms</td>
<td>178.363ms</td>
</tr>
<tr>
<td>311.359ms</td>
<td>181.397ms</td>
<td>89.606ms</td>
<td>182.709ms</td>
</tr>
<tr>
<td>309.149ms</td>
<td>181.023ms</td>
<td>87.732ms</td>
<td>180.485ms</td>
</tr>
<tr>
<td>309.828ms</td>
<td>181.405ms</td>
<td>91.891ms</td>
<td>184.785ms</td>
</tr>
<tr>
<td>309.595ms</td>
<td>176.970ms</td>
<td>86.886ms</td>
<td>178.501ms</td>
</tr>
<tr>
<td>309.081ms</td>
<td>179.109ms</td>
<td>86.885ms</td>
<td>177.811ms</td>
</tr>
<tr>
<td>308.208ms</td>
<td>176.873ms</td>
<td>86.796ms</td>
<td>178.051ms</td>
</tr>
</table>
<p></center></p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/01/13/sse2-fastest-rounding-on-x86/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Why You Should Never Cast Floats To Ints</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/12/why-you-should-never-cast-floats-to-ints/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/01/12/why-you-should-never-cast-floats-to-ints/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 17:00:10 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[LHS]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=124</guid>
		<description><![CDATA[The seemingly simple act of assigning an int to a float variable, or vice versa, can be astonishingly expensive. In some architectures it may cost 40 cycles or even more! The reasons for this have to do with rounding and the load-hit-store issue, but the outcome is simple: never cast a float to an int [...]]]></description>
			<content:encoded><![CDATA[<p>The seemingly simple act of assigning an <code>int</code> to a <code>float</code> variable, or vice versa, can be astonishingly expensive. In some architectures it may cost 40 cycles or even more! The reasons for this have to do with rounding and the <a href="http://assemblyrequired.crashworks.org/2008/07/08/load-hit-stores-and-the-__restrict-keyword/">load-hit-store</a> issue, but the outcome is simple: never cast a float to an int inside a calculation &#8220;because the math will be faster&#8221; (it isn&#8217;t), and if you must convert a float to an int for storage, do it only once, at the end of the function, when you write the final outcome to memory.</p>
<p>This particular performance suck has two major sources: <b>register shuffling</b> and <b>rounding</b>. The first one has to do with hardware and affects all modern CPUs; the second is older and more specific to x86 compilers. </p>
<p>In both cases, one simple rule holds true: whenever you find yourself typing <code>*((int *)(&#038;anything))</code>, you&#8217;re in for some pain.</p>
<h2>Register Sets: The Multiple-Personality CPU</h2>
<p>Casting floats to ints and back is something that can happen in routine tasks, like</p>
<blockquote>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">int</span> a <span style="color: #000080;">=</span> GetSystemTimeAsInt<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> 
<span style="color: #0000ff;">float</span> seconds <span style="color: #000080;">=</span> a<span style="color: #008080;">;</span></pre></div></div>

</blockquote>
<p>Or occasionally you may be tempted to perform bit-twiddling operations on a floating point number; for example, this might seem like a fast way to determine whether a float is positive or negative:</p>
<blockquote>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">bool</span> IsFloatPositive<span style="color: #008000;">&#40;</span><span style="color: #0000ff;">float</span> f<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #0000ff;">int</span> signBit <span style="color: #000080;">=</span> <span style="color: #000040;">*</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> <span style="color: #000040;">*</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#40;</span><span style="color: #000040;">&amp;</span>f<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">&amp;</span> <span style="color: #208080;">0x80000000</span><span style="color: #008080;">;</span>
    <span style="color: #0000ff;">return</span> signBit <span style="color: #000080;">==</span> <span style="color: #0000dd;">0</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

</blockquote>
<p>The problem here is that casting from one register type to another like this is an almost sure way to induce a load-hit-store. In the x86 and the PPC, integers, floating-point numbers, and SIMD vectors are kept in three separate register sets. As <a href="http://www.gamasutra.com/view/feature/3687/sponsored_feature_common_.php">Becky Heineman wrote in Gamasutra</a>, you can &#8220;think of the PowerPC as three completely separate CPUs, each with its own instruction set, register set, and ways of performing operations on the data.&#8221; </p>
<p>Integer operations (like add, sub, and bitwise ops) work only on integer registers, floating-point operations (like fadd, fmul, fsqrt) only on the FPU registers, and SIMD ops only touch vector registers. This is true of both the PowerPC and the x86 and nearly every modern CPU (with one exception, see below). </p>
<p>This makes the CPU designer&#8217;s job much easier (because each of these units can have a pipeline of different depth), but it means there is no means of directly moving data from one kind of register to another. There is no &#8220;move&#8221; operation that has one integer and one float operand. Basically, there are simply no wires that run directly between the int, float, and vector registers.</p>
<p>So, whenever you move data from an int to a float, the CPU first stores the integer from the int register to memory, and then in the next instruction reads from that memory address into the float register. This is the very definition of a <a href="http://assemblyrequired.crashworks.org/2008/07/08/load-hit-stores-and-the-__restrict-keyword/">load-hit-store</a> stall, because that first store may take as many as 40 cycles to make it all the way out to the L1 cache, and the subsequent load can&#8217;t proceed until it has finished. On an in-order processor like the 360 or the PS3&#8217;s PPC, that means everything stops dead for between 40 and 80 cycles; on an out-of-order x86, the CPU will try to skip ahead to some of the subsequent instructions, but can usually only hide a little bit of that latency.</p>
<p>It&#8217;s not the actual conversion of ints to floats that is slow &mdash; <code>const int *pA; float f = *pA;</code> can happen in two cycles if the contents of pA are already in memory &mdash; but moving data between the different kinds of registers that is slow because the data has to get to and from the memory first.  </p>
<p>What this all boils down to is that you should simply avoid mixing ints, floats, and vectors in the same calculation. So, for example, instead of</p>
<blockquote>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">struct</span> <span style="color: #008000;">&#123;</span><span style="color: #0000ff;">int</span> a,b<span style="color: #008000;">&#125;</span> Foo<span style="color: #008080;">;</span>
<span style="color: #0000ff;">float</span> func<span style="color: #008000;">&#40;</span> Foo <span style="color: #000040;">*</span>data <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
  <span style="color: #0000ff;">float</span> x <span style="color: #000080;">=</span> fsqrt<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#40;</span> data<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>a <span style="color: #000080;">&lt;&lt;</span> <span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span> <span style="color: #000040;">-</span> data<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>b<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
  <span style="color: #0000ff;">return</span> x<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

</blockquote>
<p>you are really better off with</p>
<blockquote>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">float</span> func<span style="color: #008000;">&#40;</span> Foo <span style="color: #000040;">*</span>data <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
  <span style="color: #0000ff;">float</span> fA <span style="color: #000080;">=</span> data<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>a<span style="color: #008080;">;</span>
  <span style="color: #0000ff;">float</span> fB <span style="color: #000080;">=</span> data<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>b<span style="color: #008080;">;</span>
  <span style="color: #0000ff;">float</span> x <span style="color: #000080;">=</span> fsqrt<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#40;</span> fA <span style="color: #000040;">*</span> <span style="color:#800080;">2.0f</span> <span style="color: #008000;">&#41;</span> <span style="color: #000040;">-</span> fB <span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
  <span style="color: #0000ff;">return</span> x<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

</blockquote>
<p>More importantly, if you have wrapped your native SIMD type in a union like</p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">union</span> <span style="color: #008000;">&#123;</span>
  __vector V<span style="color: #008080;">;</span>  <span style="color: #666666;">// native 128-bit VMX register</span>
  <span style="color: #0000ff;">struct</span> <span style="color: #008000;">&#123;</span> <span style="color: #0000ff;">float</span> x,y,z,w<span style="color: #008080;">;</span> <span style="color: #008000;">&#125;</span>
<span style="color: #008000;">&#125;</span> vec4<span style="color: #008080;">;</span></pre></div></div>

<p>then you really need to avoid accessing the individual float members after working with it as a vector. Never do this:</p>
<blockquote>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;">vec4 A,B,C<span style="color: #008080;">;</span>
C <span style="color: #000080;">=</span> VectorCrossProductInVMX<span style="color: #008000;">&#40;</span>A, B<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #0000ff;">float</span> x <span style="color: #000080;">=</span> C.<span style="color: #007788;">y</span> <span style="color: #000040;">*</span> A.<span style="color: #007788;">x</span> <span style="color: #000040;">*</span> <span style="color:#800080;">2.0f</span><span style="color: #008080;">;</span>
<span style="color: #0000ff;">return</span> x<span style="color: #008080;">;</span></pre></div></div>

</blockquote>
<p>There is one notable exception: the Cell processor SPU as found in the PS3. In the SPUs, <a href="http://www.research.ibm.com/cell/SPU.html">all operations &mdash; integers, floats, and vectors &mdash; operate on the same set of registers</a> and you can mix them as much as you like. </p>
<p>Now, to put this in context, 80 cycles isn&#8217;t the end of the world. If you&#8217;re performing a hundred float-to-int casts per frame in high level functions, it&#8217;ll amount to less than a microsecond. On the other hand it only takes 20,000 such casts to eat up 5% of your frame on the 360, so if this is the sort of thing you&#8217;re doing in, say, AccumulateSoundBuffersByAddingTogetherEverySample(), it may be something to look at.</p>
<h2>Rounding: Say Hello To My Little <tt>fist</tt></h2>
<p>In times gone by one of the easiest performance speedups to be had on the PC and its x86-based architecture was usually found in the way your program rounded floats.</p>
<p>The most obvious and least hardware-specific cost of <code>int a = float b</code> is that you somehow have to get rid of the fractional part of the number to turn it into a whole integer; in other words, the CPU has to turn 3.14159 into 3 without <a href="http://www.straightdope.com/columns/read/805/did-a-state-legislature-once-pass-a-law-saying-pi-equals-3">involving the Indiana State Legislature</a>. That&#8217;s simple enough, but what if the number is 3.5 &mdash; do you round it up, or down? How about -3.5 &mdash; up or down? And how about on x86-based chips, where floating point numbers are calculated inside the CPU as 80 bits but an <b>int</b> is 32 bits?</p>
<p>At the time the Intel x87 floating-point coprocessor was invented, <a href="http://en.wikipedia.org/wiki/IEEE_754-1985#Rounding_floating-point_numbers">the IEEE 754 floating point standard specified that rounding could happen in one of four ways</a>:</p>
<ol>
<li>Round to nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest even number<br/>( 3.3 &rarr; 3 , 3.5 &rarr; 4, -2.5 &rarr; -2  )</li>
<li>Round to zero – also known as <b>truncate</b>, simply throws away everything after the decimal point<br/>( 3.3 &rarr; 3 , 3.5 &rarr; 3, -2.5 &rarr; -2  )</li>
<li>Round up –<br/>( 3.3 &rarr; 4 , 3.5 &rarr; 4, -2.5 &rarr; -2  )</li>
<li>Round down  –<br/>( 3.3 &rarr; 3 , 3.5 &rarr; 3, -2.5 &rarr; -3  ).</li>
</ol>
<p>The x87 allows you to select any of these modes by setting or clearing a couple of bits in a special control register. Reading and writing that register is a very slow operation, because it means the processor has to totally throw away anything that came behind it in the pipeline and start over, so it&#8217;s best to change modes as little as possible, or not at all. </p>
<p>The actual rounding operation can be done in one instruction (the amusingly named <code>fist</code> op, which means &#8220;float-to-int store&#8221;), but there&#8217;s a snag. The ANSI C standard decrees that one and only one of these modes may ever be used for <code>int&nbsp;a&nbsp;=&nbsp;float&nbsp;b</code>: <b>truncate</b>. But, because the compiler can never be sure what rounding mode is set when you enter any particular function (you might have called into some library that set it differently), it would call a function called _ftol(), which set this mode each and every time a number was rounded. In fact, what it actually did <em>for every cast</em> was:</p>
<ol>
<li>Call into _ftol()</li>
<li>Check the old rounding mode and save it to memory</li>
<li>Set the rounding mode to &#8220;truncate&#8221; (this causes a pipeline clear)</li>
<li>Round the number</li>
<li>Set the rounding mode back to the one it saved in step one (another pipeline clear)</li>
<li>Return from _ftol()</li>
</ol>
<p>Because of this it wasn&#8217;t unusual to see a game spending over 6% of its time inside _ftol() alone. (In fact I can say with a straight face that I once saw a profile where a game spent fully 8% of each frame on <code>fist</code>ing.) This is an extreme case of the compiler choosing correctness over speed. </p>
<p>You&#8217;re thinking the answer is &#8220;well, how about I just set the rounding mode to start with and tell the compiler not to obsess so much about the exact correctness?&#8221; and you&#8217;re right. The solution in MSVC is to supply the <a href="http://msdn.microsoft.com/en-us/library/6d9xx1d2(VS.80).aspx">/QIfist compiler option</a>, which tells the compiler to assume the current rounding mode is correct and simply issue the hardware float-to-int op directly. This saves you the function call and two pipeline clears. If your rounding mode gets changed elsewhere in the program you might get unexpected results, but&#8230; you know.. don&#8217;t do that.</p>
<p>Microsoft&#8217;s documentation claims that /QIfist is &#8220;deprecated&#8221; due to their floating-point code being much faster now, but if you try it out you&#8217;ll see they&#8217;re fibbing. What happens now is that they call to _ftol2_sse() which uses the modeless SSE conversion op <code>cvttsd2si</code> instead of old _ftol(). This has some advantages &mdash; you can pick between truncation and rounding for each operation without having to change the CPU&#8217;s rounding mode &mdash; it&#8217;s still a needless function call where an opcode would do, and it shuffles data between the FPU and SSE registers which brings us back to the LHS issue mentioned above. On my Intel Core2 PC, a simple test of calling the function below is twice as fast with compiler options <code>/fp:fast /QIfist</code> specified compared with only <code>/fp:fast</code>. </p>
<blockquote>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">void</span> test1<span style="color: #008000;">&#40;</span><span style="color: #0000ff;">volatile</span> <span style="color: #0000ff;">int</span> <span style="color: #000040;">*</span>a, <span style="color: #0000ff;">volatile</span> <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span>f<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
  <span style="color: #0000ff;">for</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span> i<span style="color: #000080;">=</span><span style="color: #0000dd;">0</span><span style="color: #008080;">;</span> i <span style="color: #000080;">&lt;</span> <span style="color: #0000dd;">1000000</span> <span style="color: #008080;">;</span> <span style="color: #000040;">++</span>i<span style="color: #008000;">&#41;</span>
    <span style="color: #000040;">*</span>a <span style="color: #000080;">=</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">int</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">*</span>f<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

</blockquote>
<p>On the other hand, in an absolute sense _ftol2_sse() is pretty fast so it may be good enough.</p>
<p>It&#8217;s also possible to <a href="http://www.stereopsis.com/sree/fpu2006.html">convert floats to ints by adding them to a certain magic number</a>, but this isn&#8217;t always a benefit. <a href="http://chrishecker.com/Miscellaneous_Technical_Articles#Floating_Point">In times of yore the <code>fistp</code> op was slow</a>, so there was an advantage to replacing the <code>fist</code> with a <code>fadd</code>, but this doesn&#8217;t seem to be the case any more. It <i>is</i> faster than an implicit call to _ftol2_sse, and it has the advantage of not depending on the CPU&#8217;s current rounding mode (since you can pick your magic number to choose between rounding and truncation). On the other hand if you&#8217;ve specified <code>/arch:sse2</code> and the compiler is using SSE scalar operations instead of the x87 FPU generally, then it&#8217;s faster to let it use the native <tt>cvttss2si</tt> op.</p>
<p>On the 360/PS3 CPU, the magic number technique is usually a performance hit, because most of the magic-number tricks involve an integer step on the floating-point number and run into register-partitioning issue mentioned above. </p>
<h2>Further Reading</h2>
<ul>
<li>My primary source for the register-type load-hit-store issue was the XDK documentation; consult your own for more details.</li>
<li>David Goldberg&#8217;s <a href="http://www.physics.ohio-state.edu/~dws/grouplinks/floating_point_math.pdf">What Every Computer Scientist Should Know About Floating-Point Arithmetic</a> is the definitive survival guide for the IEEE754 wilds.</li>
<li>You can learn more about the &#8220;magic integer&#8221; technique for converting between floats and ints <a href="http://www.garagegames.com/index.php?sec=mg&#038;mod=resource&#038;page=view&#038;qid=961">here</a> and <a href="http://www.stereopsis.com/sree/fpu2006.html">here</a>, and I also found <a href="http://www.stereopsis.com/FPU.html">an interesting article about the state of affairs circa 2000</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/01/12/why-you-should-never-cast-floats-to-ints/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Down With fcmp: Conditional Moves For Branchless Math</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/04/fcmp-conditional-moves-for-branchless-math/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/01/04/fcmp-conditional-moves-for-branchless-math/#comments</comments>
		<pubDate>Sun, 04 Jan 2009 18:00:45 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[branchless]]></category>
		<category><![CDATA[intrinsics]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=50</guid>
		<description><![CDATA[Branching on floating-point values can be a significant performance penalty. In some cases, using the ternary operator ?: or substituting a conditional-move operator like fsel allows you to make your function branchless and run much faster.]]></description>
			<content:encoded><![CDATA[<p>Although it&#8217;s often considered one of the dustier branches of C++, the <code>?:</code> ternary operator can sometimes compile to code that actually runs much faster than the equivalent <code>if/else</code>. Some compilers translate it directly to a hardware operation called a <b>conditional move</b> which allows the compiler to select between two values without having to perform any comparison or branch operations. As I&#8217;ll show below, in some cases, particularly where floating point numbers are involved, this can result in really significant performance gains.</p>
<h2>The Slow Tree Is Full Of Branches</h2>
<p>In opcode parlance, any instruction that causes the CPU to skip from one instruction to another is called a <i>branch</i>. If the jump occurs based on some condition being met, like in an <code>if ( x == y ) { doSomething() ; }</code> block, it&#8217;s called a <i>conditional</i> branch: in this case, the program <i>branches</i> to either run doSomething() or not depending on the result of a &#8220;compare x to y&#8221; operation. </p>
<p>For various reasons, a branch instruction can be pretty slow to execute. The important one here is that modern CPUs process their instructions in many steps along a <a href="http://users.cs.fiu.edu/~downeyt/cop3402/pipeline.html">pipeline</a>. Like parts moving along a conveyor belt in a factory, a CPU have as many as eighty program instructions in flight at once, each of them being processed a little bit more at each of the eighty steps along the assembly line until it is finished and the final results written back to memory. </p>
<p>When a conditional branch instruction executes, the CPU can&#8217;t know which instructions are supposed to come after the branch on the assembly line. Usually instructions are executed in sequence, so that opcode #1000 is followed by #1001 and that by #1002 and so on, but if #1002 turns out to be a branch to instruction #2000, then any work the compiler might have done on instructions #1003-#1010 at earlier stages in the pipeline has to be thrown away. Modern CPUs and compilers have gotten quite good at mitigating this effect through <i><a href="http://users.cs.fiu.edu/~downeyt/cop3402/prediction.html">branch prediction</a></i> but in some cases, particularly involving floating-point numbers, branch prediction may be impossible in a way that causes the <i>entire</i> assembly line to be cleared out and restarted.</p>
<p>To understand this, first consider this pseudoassembly for how we could compile <code> x = (a &gt;= 0 ? b : c)</code> using a compare followed by a branch instruction:</p>
<p><code>// assume four registers named ra, rb, rc, rx<br />
// where rx is the output<br />
// and r0 contains zero<br />
COMPARE ra, r0  // sets "ge" condition bit to 1 if ra &gt;= 0<br />
JUMPGE <strong>cbranch</strong>  // executes a GOTO cbranch: if the "ge" bit is 1<br />
MOVE rx, rb&nbsp;&nbsp;&nbsp;&nbsp;// sets rx = rb<br />
JUMP finish<br />
<strong> cbranch:</strong><br />
MOVE rx, rc&nbsp;&nbsp;&nbsp;&nbsp;// sets rx = rc<br />
<strong> finish:</strong><br />
RETURN rx </code></p>
<p>There are a couple of ways that this can slow down the instruction pipeline. Firstly, it always executes at least one branch operation, which is expensive: on the PowerPC such a branch can stall execution for between five and twenty cycles depending on whether it is predicted correctly. Also, the conditional jump occurs immediately after the COMPARE operation, which can lead to another stall if the result of the COMPARE isn&#8217;t ready within a single cycle.</p>
<p>But with floating-point compares the situation can be much worse because float pipelines are often much longer than the integer/branch pipeline. This means that the result of the COMPARE may not be available to the branch unit for many cycles after it dispatches. When the CPU tries to branch on a result that isn&#8217;t available yet, it has no choice but to flush the entire pipeline, wait for the compare to finish and start over. For a modern 40-cycle long pipeline this can be quite costly indeed!  For example, consider this simplified pipeline of a hypothetical machine.</p>
<p>The fetch stage takes six cycles, then instructions are dispatched after a three cycle delay to either an integer, branch, or float pipeline.</p>
<div id="attachment_65" class="wp-caption aligncenter" style="width: 610px"><img class="size-full wp-image-65" title="Step 1" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/fcmp_ipipeline_step1.png" alt="a float-compare enters the fetch stage, immediately followed by a branch-on-greatereq" width="600" height="400" /><p class="wp-caption-text">a float-compare enters the fetch stage, immediately followed by a branch-on-greater-or-equal</p></div>
<p>The fcmp instruction begins to execute, but its results won&#8217;t be ready until it leaves the eight-stage float pipeline. So, the branch instruction immediately following it can&#8217;t execute yet.</p>
<div id="attachment_66" class="wp-caption aligncenter" style="width: 610px"><img class="size-full wp-image-66" title="Step 2" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/fcmp_ipipeline_step2.png" alt="The fcmp begins to execute, but the branch pipeline can't evaluate until the results are ready." width="600" height="400" /><p class="wp-caption-text">The fcmp begins to execute, but the branch pipeline can&#39;t evaluate until the results are ready.</p></div>
<p>It (and all the instructions behind it) gets flushed from the pipeline and has to start over from the beginning.</p>
<div id="attachment_67" class="wp-caption aligncenter" style="width: 610px"><img class="size-full wp-image-67" title="fcmp_ipipeline_step3" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/fcmp_ipipeline_step3.png" alt="The fcmp continues to execute, while everything behind it in the pipeline is flushed altogether." width="600" height="400" /><p class="wp-caption-text">The fcmp continues to execute, while everything behind it in the pipeline is flushed altogether.</p></div>
<p>Once the fcmp is done, the branch is allowed to enter the pipe again,</p>
<div id="attachment_68" class="wp-caption aligncenter" style="width: 610px"><img class="size-full wp-image-68" title="fcmp_ipipeline_step4" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/fcmp_ipipeline_step4.png" alt="Branch instruction reenters pipeline" width="600" height="400" /><p class="wp-caption-text">Branch instruction starts over while fcmp executes.</p></div>
<p>and then finally executes some time later. Even though the branch immediately follows the fcmp in the program, it doesn&#8217;t actually get executed until over twenty cycles later!</p>
<div id="attachment_64" class="wp-caption aligncenter" style="width: 610px"><img class="size-full wp-image-64" title="fcmp_ipipeline_step5" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/fcmp_ipipeline_step5.png" alt="Branch instruction finally begins to execute" width="600" height="400" /><p class="wp-caption-text">Branch instruction finally executes after fcmp completes.</p></div>
<p>To solve all of these issues and others, hardware designers have invented instructions that can perform all this work in a single branchless operation. Unfortunately, compilers aren&#8217;t always smart enough to use them, so sometimes you need to do a little work yourself.</p>
<h2>The conditional move</h2>
<p>Because branches can be slow, many architectures implement a means of selecting between two different values based on a third without having to actually execute a comparison and jump operation. This is usually called a &#8220;conditional move&#8221; or &#8220;branchless select&#8221; and expresses the operation: <em>if <strong>a</strong> &ge; 0 then <strong>b</strong> else <strong>c</strong>;</em> in C++ that is</p>
<p><code>// int a, b, c;<br />
int x = a &gt;= 0 ? b : c ;</code></p>
<div>or, more LISPishly,</div>
<p><code>( if (&gt;= a 0) b c ) </code></p>
<p>There are numerous reasons why this is useful but from my point of view the best is improving performance by eliminating pipeline bubbles. Unlike an if/else implemented as a branch op, a conditional move doesn&#8217;t change the flow of the program or cause it to jump from one instruction or another; it simply assigns either value <i>a</i> or <i>b</i> to the contents of register <i>x</i> in a single instruction. Thus, it can never cause the pipeline to clear or invalidate any of the instructions that come after it. </p>
<p>The floating-point conditional move operator on the PPC is called fsel, which works like:</p>
<p><code>fsel f0, f1, f2, f3 // f0 = ( f1 &gt;= 0 ? f2 : f3 )</code></p>
<p>The most recent version of the compiler I use is pretty good at compiling this automatically for cases like this:</p>
<table border="0" width="100%">
<tbody>
<tr>
<td><code>return a &gt;= 0 ? b : c;</code></td>
<td><code>fsel fr1,fr1,fr2,fr3</code></td>
</tr>
</tbody>
</table>
<p>and it even does the right thing with this:</p>
<table border="0" width="100%">
<tbody>
<tr style="vertical-align: top">
<td>
<pre>// float a, b, c, d, e, f;
return ( a &gt;= 0 ? b + 1 : c + 2 ) +
   ( d &gt;= 0 ? e + 1 : f + 2 ) ;</pre>
</td>
<td>
<pre>; fr1 = a, fr2 = b, fr3 = c,
; fr4 = d, fr5 = e, fr6 = f
; fr0 = 1.0, fr13 = 2.0f
 fadds   fr12,fr2,fr0      ; fr12 = a + 1
 fadds   fr11,fr3,fr13     ; fr11 = c + 2
 fadds   fr10,fr5,fr0      ; fr10 = e + 1
 fadds   fr9,fr6,fr13      ; fr9  = f + 2
 fsel    fr8,fr1,fr12,fr11 ; fr8  = a >= 0 ? fr12 : fr11
 fsel    fr7,fr4,fr10,fr9  ; fr7  = d >= 0 ? fr10 : fr9
 fadds   fr1,fr8,fr7     ; return = fr8 + fr7
</pre>
</td>
</tr>
</tbody>
</table>
<p>but it has trouble with more complicated scenarios like this:</p>
<table border="0" width="100%">
<tbody>
<tr style="vertical-align: top">
<td><code>// float a, b, c, d;</code> </p>
<p><code> return a &gt;= b ? c : d;</code></td>
<td>
<pre>; fr1 = a, fr2 = b, fr3 = c, fr4 = d
fcmpu        cr6,fr1,fr2  ; compare fr1 and fr2
blt          cr6,$LN3     ; if compare result was
                          ; "less than", GOTO $LN3
fmr          fr1,fr3      ; move fr3 to fr1
blr                       ; return
$LN3:
fmr          fr1,fr4      ; move fr4 to fr1
blr                       ; return</pre>
</td>
</tr>
</tbody>
</table>
<p>and if you use an if/else block instead of the ternary operator, all bets are off.</p>
<p>So, for these more complex cases, the compiler exposes as an <i>intrinsic</i> with a prototype something like:</p>
<p><code>float fsel( float a, float x, float y );<br />
// equivalent to { return a &gt;= 0 ? x : y ; }</code></p>
<p>A <b>compiler intrinsic</b> looks like an ordinary C++ function but compiles directly to the native hardware instruction, providing you with direct access to the opcode without having to actually write assembly. The caveat is that the hardware fcmp can only perform a &ge; 0 comparison (on the PPC anyway), so for complex conditionals like</p>
<p><code>if ( a / 2 &lt; b + 3 ) { return c;  } else { return d; }</code></p>
<p>you need to do a little algebra to transform the inequality <code>a / 2 &lt; b + 3</code> into <code>a - 2b - 6 &lt; 0</code> and then, because <em>x</em> &lt; 0 &equiv; ¬(<em>x</em> ? 0), flip the order of the operands to get <code>return _fsel( a - 2*b - 6, d, c );</code>. Once you&#8217;ve got everything fed into the intrinsic, the compiler can do a good job of scheduling your other instructions.</p>
<h2>Life beyond if</h2>
<p>The advantage of conditional moves comes when you use them to eliminate branching from a complex mathematical operation altogether. The most obvious cases are something like the good old <code>min( float a, float b ) { return a &lt;= b ? a : b } </code> which becomes <span style="font-family: -webkit-monospace;">return _fsel( b &#8211; a, a, b ); <span style="font-family: Georgia;">which means you get through something wacky like <code>x += min(a,b) * max(c,d)</code> without any branches at all.</span></span></p>
<p>Because math is costly, it may seem like a good idea to use an <code>if</code> statement to prevent unnecessary calculations from being performed, but this can be the wrong thing to do. <b>Sometimes using a floating-point branch to early-out of some calculations may actually <em>hurt</em> performance</b>. In some cases, the cost of the branch may be much greater than the cost of the calculation itself, and so if you&#8217;re using if/else to choose between performing one of two different calculations, it may be better to do <em>both</em> calculations and use fsel to choose between them afterwards. </p>
<p>Often the compiler can interleave the calculations so the cost of both is the same as doing only one, and you avoid the pipeline clear caused by the fcmp. For example, this code:</p>
<pre>float out[N], inA[N], inB[N], cond[N];
for (int i = 0 ; i &lt; N ; ++i )
{
  if ( cond[i] &gt;= 0 )
    {     out[i] = cond[i] + inA[i];   }
  else
    {     out[i] = cond[i] + inB[i];   }
}</pre>
<p>can be turned into:</p>
<pre>for (int i = 0 ; i &lt; N ; ++i )
{
    out[i] = fsel( cond[i], cond[i] + inA[i], cond[i] + inB[i] );
}</pre>
<p>In the top version, we choose to do one of two additions based on the sign of cond[i]. In the bottom version, we perform two additions and throw away one result, but even so it is much faster! When I tested 200,000,000 iterations of the above code, the if/else version took 5.243 seconds compared with 0.724 seconds for the fsel version: a <em><strong>7x </strong></em>speedup for <em>not</em> avoiding an addition!</p>
<p>Another example is if you have a large quantity of such assignments to do one after another, like<br />
<code>struct CFoo { float x, y, z, w; }<br />
// CFoo out, A, B<br />
// float cond[4]; (some number that acts as a selector)<br />
out.x += fsel( cond[0], A.x, B.x );<br />
out.y += fsel( cond[1], A.y, B.y );<br />
out.z += fsel( cond[2], A.z, B.z );<br />
out.w += fsel( cond[3], A.w, B.w );</code></p>
<p>In this case, because all of the fsel operations are independent (the result of one line does not depend on the result of another), they can all be scheduled to occur immediately after one another for a throughput of one per cycle. Had if/else been used instead (<em>eg</em> <code>if ( cond[0] &gt;= 0 ) out.x += A.x; else out.x += B.x;</code>), then each branch would have depended on the result of the preceding comparison, meaning that we would need to pay the full 60-cycle fcmp/branch penalty for each line before starting the next one.</p>
<p>By now you&#8217;re probably looking at the example above and thinking, &#8220;for four floats packed together like that, I could use a SIMD instruction to compare and assign them all at once,&#8221; and if so you&#8217;re absolutely right. Both VMX (on the PowerPC) and SSE (on the x86) also have conditional move instructions that work very similarly to the scalar float operations I&#8217;ve described above. Consult your documentation for the exact semantics, but in general they work by performing a comparison operation to create a mask which is then fed into a subsequent &#8220;select&#8221; operation.</p>
<p>The general principle is that it can be better to perform both arms of a calculation and then select between them afterwards than to use an if/else to perform only one arm of the calculation. This is even more true for SIMD units since the pipelines are often deeper.</p>
<h2>Integer Conditional Move</h2>
<p>All the discussion above has focused on floating-point conditional move operations. The x86 platform offers an analogous integer conditional move, called <code>CMOV</code>, but some PowerPC implementations lack this native opcode. Instead, you can manufacture one yourself by creating a mask and then using a bitwise-and to select between two values.</p>
<p>The key is to remember that for any signed int <em>a</em> &lt; 0, the leftmost (most significant) bit of <em>a</em> will be 1, and that the C++ arithmetic-shift operator <code>&gt;&gt;</code> always preserves the sign of the word you are shifting. So, <code>int mask = a &gt;&gt; 31;</code> means the same as <code>int mask = (a &lt; 0 ? -1 : 0)</code>.</p>
<p>Once you have your mask, you can combine it with a bitwise-and and an add to make your conditional move. Given that <em>x</em> + ( <em>y</em> &#8211; <em>x</em> ) = <em>y</em>, you can generate your integer select function like so:</p>
<pre>// if a &gt;= 0, return x, else y
int isel( int a, int x, int y )
{
    int mask = a &gt;&gt; 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a &lt; 0) and 0x00 otherwise.
    return x + ((y - x) &amp; mask);
};</pre>
<p>The assembly for this works out to about four instructions. The performance gain usually isn&#8217;t as great as with fsel, because the integer pipeline tends to be much shorter and so there&#8217;s no pipeline clear on an integer compared followed by a branch; however this construction does give the compiler more latitude in scheduling instructions, and lets you avoid the innate prediction penalty that always occurs with any branch.</p>
<p>In addition to this most PowerPCs have a &#8220;count leading zeroes&#8221; instruction that lets a smart compiler do something really cool with <code>return a + ( (b - a) &amp; ((Q == W) ? 0 : -1) );.</code> Try it out in your disassembler and see!</p>
<h2>Further Reading</h2>
<p><a href="http://www.cellperformance.com/articles/2006/07/tutorial_branch_elimination_pa.html">Mike Acton writes extensively about branch elimination at his CellPerformance blog</a>. In addition to <a href="http://www.cellperformance.com/articles/2006/04/more_techniques_for_eliminatin_1.html">fsel</a> specifically, he describes many other techniques of branchless programming.</p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/01/04/fcmp-conditional-moves-for-branchless-math/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Happy Holidays! (no post today)</title>
		<link>http://assemblyrequired.crashworks.org/2008/12/29/happy-holidays-08/</link>
		<comments>http://assemblyrequired.crashworks.org/2008/12/29/happy-holidays-08/#comments</comments>
		<pubDate>Mon, 29 Dec 2008 17:00:43 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=99</guid>
		<description><![CDATA[This is the week when all of Santa&#8217;s Elves who work in game-making-places all over the world take a hopefully well deserved break, so no game development post today. Instead, here&#8217;s some nice audio books and podcasts that I&#8217;ve enjoyed on my commutes in the past year. Happy Holidays!

Escape Pod&#8217;s short-fiction podcast has not one [...]]]></description>
			<content:encoded><![CDATA[<p>This is the week when all of Santa&#8217;s Elves who work in game-making-places all over the world take a hopefully well deserved break, so no game development post today. Instead, here&#8217;s some nice audio books and podcasts that I&#8217;ve enjoyed on my commutes in the past year. Happy Holidays!</p>
<ul>
<li><a href="http://escapepod.org/">Escape Pod</a>&#8217;s short-fiction podcast has not <a href="http://escapepod.org/2007/12/25/ep138-in-the-late-december/">one</a> but <a href="http://escapepod.org/2008/12/25/ep184-as-dry-leaves-that-before-the-wild-hurricane-fly/">two</a> Christmas specials.</li>
<li>X Minus One, <a href="http://www.podango.com/podcast_episode/2067/91797/X_Minus_One_Podcast/X_Minus_One_67_Lifeboat_Mutiny"><em>The Lifeboat Mutiny</em></a>. 50s-era science fiction radio play: two prospectors find themselves in possession of a lifeboat that&#8217;s more than it seems. Worth listening to for the Drone National Anthem alone!</li>
<li>G.K. Chesterton&#8217;s  <a href="http://librivox.org/the-wisdom-of-father-brown-by-g-k-chesterton/"><em>The Wisdom Of Father Brown</em></a> is available as a public-domain audio book from LibriVox, read by <a href="http://web.mac.com/martin.clifton/iWeb/Martin%20Clifton/Home.html">Martin Clifton</a>. I found this collection of short stories about the eponymous priest-turned-private-detective to be fine treadmill listening.</li>
<li>WNYC&#8217;s science documentary special <em><a href="http://www.wnyc.org/shows/radiolab/">Radio Lab</a> </em>is always a singular pleasure.</li>
<li>And <a href="http://www.subterraneanpress.com/">Subterranean Press</a> was kind enough to produce a free audio book of Charlie Stross&#8217; <a href="http://subterraneanpress.com/index.php/magazine/winter-2008/audio-trunk-and-disorderly-by-charles-stross/"><em>Trunk and Disorderly</em></a>, a humorous adventure in a style that defies description (the closest anyone&#8217;s come is &#8220;Heinlein meets Wodehouse in space with a woolly mammoth&#8221;).</li>
</ul>
<p>I&#8217;ll be back next week with some thoughts on replacing floating-point conditionals with branchless math.</p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2008/12/29/happy-holidays-08/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sentences That Should Be Carved Into Foreheads</title>
		<link>http://assemblyrequired.crashworks.org/2008/12/22/ea-stl-prevents-memory-leaks/</link>
		<comments>http://assemblyrequired.crashworks.org/2008/12/22/ea-stl-prevents-memory-leaks/#comments</comments>
		<pubDate>Mon, 22 Dec 2008 18:00:18 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[The War On malloc]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=92</guid>
		<description><![CDATA[Non-desktop platforms don't have paged memory. If the application exhausts memory, it dies.
The lack of paged memory means that memory fragmentation can kill an application. 
Therefore, game applications cannot leak memory. Ever. Period.]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html">Paul Pedriana&#8217;s article on Electronic Arts&#8217; custom</a> <a href="http://www.sgi.com/tech/stl/">STL</a> I came across this <a href="http://www.google.com/search?hl=en&amp;q=very+good+sentences+site%3Amarginalrevolution.com&amp;btnG=Search">Very Good Sentence</a> (well, okay, pair of sentences):</p>
<blockquote><p>Game applications cannot leak memory. If an application leaks even a small amount of memory, it eventually dies.</p></blockquote>
<p>I say <em>preach it</em>, brother. I can think of a couple of titles where memory leaks actually hurt a game&#8217;s sales. In fact I&#8217;ve been known to blame reckless use of <code>new</code> for the death of entire studios. <code>malloc()</code> — there&#8217;s no evil too great to lay at its feet.</p>
<p>Paul&#8217;s sentence comes in the context of a group of justifications for EA building its own Standard Template Library, which are so concisely spot-on that I&#8217;ll reproduce them here. Call them Paul&#8217;s Commandments:</p>
<p><span id="more-92"></span></p>
<blockquote>
<ul>
<li>No matter how powerful any game computer ever gets, it will never have any free memory or CPU cycles.</li>
<li>Game developers are very concerned about software performance and software development practices.</li>
<li>Game software often doesn&#8217;t use conventional synchronous disk IO such as &lt;stdio.h&gt; or &lt;fstream&gt; but uses asynchronous IO.</li>
<li>Game applications cannot leak memory. If an application leaks even a small amount of memory, it eventually dies.</li>
<li>Every byte of allocated memory must be accounted for and trackable. This is partly to assist in leak detection but is also to enforce budgeting.</li>
<li>Game software rarely uses system-provided heaps but uses custom heaps instead.</li>
<li> A lot of effort is expended in reducing memory fragmentation.</li>
<li>A lot of effort is expended in creating memory analysis tools and debugging heaps.</li>
<li>A lot of effort is expended in improving source and data build times.</li>
<li>Application code and libraries cannot be very slow in debug builds.</li>
<li>Memory allocation of any type is avoided to the extent possible.</li>
<li>Operator new overrides (class and global) are the rule and not the exception.</li>
<li>Use  of built-in global operator new is verboten, at least with shareable libraries.</li>
<li>Any memory a library allocates must be controllable by the user.</li>
<li>Game software must be savvy to non-default memory alignment requirements.</li>
<li>Memory pools are sometimes used in order to avoid fragmentation, even though they necessarily waste some memory themselves.</li>
<li>Branching (if/else/while/for/do) is avoided to the extent possible, especially mispredicted branches.</li>
<li>Virtual functions are avoided to the extent possible, especially in bottleneck code.</li>
<li>Exception handling is usually disabled. <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html#Appendix_17"><br />
</a></li>
<li>RTTI is usually disabled or at least unused in shipping code.</li>
</ul>
</blockquote>
<p>You can find <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html">the whole article here</a>; it&#8217;s a worthwhile read. If you&#8217;re in a hurry, you can safely skip the large tables; the meat of the article is in &#8220;<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html#Motivation">Motivation</a>&#8221; and the sections following &#8220;<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html#game_software_issues">Game Software Issues</a>&#8220;.</p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2008/12/22/ea-stl-prevents-memory-leaks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A better Windows environment variable editor</title>
		<link>http://assemblyrequired.crashworks.org/2008/12/17/rapid-environment-editor-better-than-windows-dialog-for-editing-path/</link>
		<comments>http://assemblyrequired.crashworks.org/2008/12/17/rapid-environment-editor-better-than-windows-dialog-for-editing-path/#comments</comments>
		<pubDate>Thu, 18 Dec 2008 04:13:04 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Productivity]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=89</guid>
		<description><![CDATA[Rapid Environment Editor is a free tool that lets you easily edit Windows environment variables such as PATH in a much better interface than Windows' built-in dialog.]]></description>
			<content:encoded><![CDATA[<p>Doesn&#8217;t Windows&#8217; built-in dialog for editing environment variables just <em>suck</em>? The one you get when you right-click &#8220;My Computer&#8221; and go to properties&rarr;advanced, I mean. It looks like a holdover from Windows 3.1 — a tiny, unresizable window that shows you at most <em>part</em> of five environment variables, and an &#8220;edit&#8221; button that leads to a similarly cramped and unresizable pair of forms which are never big enough.</p>
<p><img class="size-full wp-image-88 alignnone" title="windows_environment_dialog1" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/windows_environment_dialog1.png" alt="windows_environment_dialog1" width="384" height="430" /> <img class="alignnone size-full wp-image-86" title="windows_environment_dialog2" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/windows_environment_dialog2.png" alt="windows_environment_dialog2" width="347" height="147" /></p>
<p>Whenever I need to edit my <code>%PATH%</code> (which is often) I find myself copying the whole thing out of the Windows dialog, pasting it into Notepad, editing it there, and then copying-and-pasting back into the Windows dialog.</p>
<p>Enter <strong>the freeware </strong><a href="http://www.rapidee.com"><strong>Rapid Environment Editor</strong></a>, a sweet little tool that is infinitely better than the Windows builtin for editing <code>%PATH%</code> and all your other envvars without pain. Long semicolon-seperated environment variables are displayed as drop-down trees (like my <code>PATH</code> below), and it even highlights broken folders with red text!</p>
<p style="text-align: center;"><a href="http://www.rapidee.com"><img class="size-full wp-image-87 aligncenter" title="rapid_environment_editor_dialog" src="http://assemblyrequired.crashworks.org/wp-content/uploads/2008/12/rapid_environment_editor_dialog.png" alt="rapid_environment_editor_dialog" width="564" height="824" /></a></p>
<p>It&#8217;s a quick 800kb download and the best thing to happen to me today.</p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2008/12/17/rapid-environment-editor-better-than-windows-dialog-for-editing-path/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
