<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Some Assembly Required &#187; Uncategorized</title>
	<atom:link href="http://assemblyrequired.crashworks.org/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://assemblyrequired.crashworks.org</link>
	<description>Technical Notes On Game Development</description>
	<lastBuildDate>Wed, 04 Nov 2009 09:52:17 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Square Roots in vivo: normalizing vectors</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/</link>
		<comments>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/#comments</comments>
		<pubDate>Tue, 20 Oct 2009 15:55:59 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=329</guid>
		<description><![CDATA[Following my earlier article on timing various square-root functions on the x86, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector [...]]]></description>
			<content:encoded><![CDATA[<p>Following <a href="http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/">my earlier article on timing various square-root functions on the x86</a>, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector or normalizing it, like when you need to perform a distance check between two characters to determine whether they can see/shoot/etc each other. So, I wrote up a group of normalize functions, each using a different sqrt technique, and timed them.</p>
<p>The testbed was, as last time, an array of 2048 single-precision floating point numbers, this time interpreted as a packed list of 682 three-dimensional vectors. This number was chosen so that both it and the output array were sure to fit in the L1 cache; however, because three floats add up to twelve bytes, this means that three out of four vectors <b>were not aligned</b> to a 16-byte boundary, which is significant for the SIMD test case as I had to use the <code>movups</code> unaligned load op. Each timing case consisted of looping over this array of vectors 2048 times, normalizing each and writing the result to memory.</p>
<p>Each normalize function computed the length of the vector 1/&radic;(x<sup>2</sup> + y<sup>2</sup> + z<sup>2</sup>), multiplied each component by the reciprocal,  and then wrote it back through an output pointer. The main difference was in how the reciprocal square root was computed:</p>
<ul>
<li>via the x87 FPU, by simply compiling <code>1.0f/sqrt( x*x + y*y + z*z )</code></li>
<li>via the SSE scalar unit, by compiling <code>1.0f/sqrt( x*x + y*y + z*z )</code> with the <a href="http://msdn.microsoft.com/en-us/library/7t5yh4fd(VS.80).aspx">/arch:SSE2</a> option set; this causes the compiler to issue a <code>sqrtss</code> followed by an <code>fdiv</code> &mdash; <i>ie</i>, it computes the square root and then divides one by it
<li>via the SSE scalar unit, by using the estimated reciprocal square root intrinsic and then performing one step of Newton-Raphson iteration</li>
<li>via the SSE SIMD unit,  working on the whole vector at once</li>
</ul>
<p>In all cases the results were accurate to 22 bits of precision. The results for 1,396,736 vector normalizations were:</p>
<div align="center" >
<table border class="padded">
<tr>
<th>Method</th>
<th>Total time</th>
<th>Time per vector</th>
</tr>
<tr>
<td>Compiler <code>1.0/sqrt(x)</code> <br />x87 FPU <code>FSQRT</code></td>
<td>52.469ms</td>
<td>37.6ns</td>
</tr>
<tr>
<td>Compiler <code>1.0/sqrt(x)</code> <br />SSE scalar <code>sqrtss</code></td>
<td>27.233ms</td>
<td>19.5ns</td>
</tr>
<tr>
<td>SSE <b>scalar</b> ops<br /><code>rsqrtss</code> with one NR step</td>
<td>21.631ms</td>
<td>15.5ns</td>
</tr>
<tr>
<td>SSE SIMD ops <br /><code>rsqrtss</code> with one NR step</td>
<td>20.034ms</td>
<td>14.3ns</td>
</tr>
</table>
</div>
<p>Two things jump out here. First, even when the square root op is surrounded by lots of other math &mdash; multiplies, adds, loads, stores &mdash; optimizations such as this can make a huge difference. It&#8217;s not just the cost of the sqrt itself, but also that it&#8217;s unpipelined, which means it ties up an execution unit and prevents any other work from being done until it&#8217;s entirely completed. </p>
<p>Second, in this case, SIMD is only a very modest benefit. That&#8217;s because the input vectors are unaligned, and the two key steps of this operation, the dot product and the square root, are scalar in nature. (This is what&#8217;s meant by &#8220;horizontal&#8221; SIMD computation &mdash; operations between the components of one vector, rather than between the corresponding words of two vectors. Given a vector V &ni; &lt;x,y,z&gt;, the sum x + y + z is <i>horizontal</i>, but with two vectors V<sub>1</sub> and V<sub>2</sub>, V<sub>3</sub> = &lt;x<sub>1</sub>+x<sub>2</sub>, y<sub>1</sub>+y<sub>2</sub>, z<sub>1</sub>+z<sub>2</sub>&gt; is <i>vertical</i>.) So it really doesn&#8217;t play to SIMD&#8217;s strengths at all.</p>
<p>On the other hand, if I were to normalize four vectors at a time, so that four dot products and four rsqrts could be performed in parallel in the four words of a vector register, then the speed advantage of SIMD would be much greater. But, again, my goal wasn&#8217;t to test performance in tight loops over packed data &mdash; it was to figure out the best way to do something like an angle check in the middle of a character&#8217;s AI, where you usually deal with one vector at a time.</p>
<p>Source code for my testing functions below the jump. Note that each function writes the normalized vector through an out pointer, but also returns the original vector&#8217;s length. The hand-written intrinsic versions probably aren&#8217;t totally optimal, but they ought to be good enough to make the point.<br />
<span id="more-329"></span></p>
<p><a style="display:none;" id="ddetlink1131721140" href="javascript:expand(document.getElementById('ddet1131721140'))">Naive vector normalize, x87 FPU or SSE scalar</a>
<div class="ddet_div" id="ddet1131721140"><script language="JavaScript" type="text/javascript">expand(document.getElementById('ddet1131721140'));expand(document.getElementById('ddetlink1131721140'))</script>
<u>Source</u></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #666666;">// Normalizes an assumed 3-element vector starting</span>
<span style="color: #666666;">// at pointer V, and returns the length of the original</span>
<span style="color: #666666;">// vector.</span>
<span style="color: #0000ff;">inline</span> <span style="color: #0000ff;">float</span> NaiveTestNormalize<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vOut, <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vIn <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
        <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">float</span> l <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span><span style="color: #000040;">*</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">+</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span><span style="color: #000040;">*</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">+</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span><span style="color: #000040;">*</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
        <span style="color: #0000ff;">const</span> <span style="color: #0000ff;">float</span> rsqt <span style="color: #000080;">=</span> <span style="color:#800080;">1.0f</span> <span style="color: #000040;">/</span> <span style="color: #0000dd;">sqrt</span><span style="color: #008000;">&#40;</span>l<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">*</span> rsqt<span style="color: #008080;">;</span>
        vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">*</span> rsqt<span style="color: #008080;">;</span>
        vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> <span style="color: #000080;">=</span> vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> <span style="color: #000040;">*</span> rsqt<span style="color: #008080;">;</span>
        <span style="color: #0000ff;">return</span> rsqt <span style="color: #000040;">*</span> l<span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><u>Assembly (x87 FPU)</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">PROC</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize, COMDAT</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 396  :        const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">-</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 397  :        const float rsqt = 1.0f / sqrt(l);</span>
<span style="color: #666666; font-style: italic;">; 398  :        vOut[0] = vIn[0] * rsqt;</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ecx</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">-</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmulp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">faddp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmulp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">2</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">faddp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fsqrt</span>
        <span style="color: #0000ff; font-weight: bold;">fld1</span>
        <span style="color: #0000ff; font-weight: bold;">fdivrp</span>  <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fstp</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 399  :        vOut[1] = vIn[1] * rsqt;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fstp</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 400  :        vOut[2] = vIn[2] * rsqt;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        <span style="color: #0000ff; font-weight: bold;">fstp</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 401  :        return rsqt * l;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fmulp</span>   <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">1</span><span style="color: #009900; font-weight: bold;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ST</span><span style="color: #009900; font-weight: bold;">&#40;</span><span style="color: #0000ff;">0</span><span style="color: #009900; font-weight: bold;">&#41;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 402  : }</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">ENDP</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p><u>Assembly (compiler-issued SSE scalar)</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_l$ = <span style="color: #339933;">-</span><span style="color: #0000ff;">4</span>                                                <span style="color: #666666; font-style: italic;">; size = 4</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_rsqt$ = <span style="color: #0000ff;">12</span>                                             <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">PROC</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize, COMDAT</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 392  : {</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">push</span>    <span style="color: #00007f;">ecx</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 393  :        const float l = vIn[0]*vIn[0] + vIn[1]*vIn[1] + vIn[2]*vIn[2];</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movss   xmm1<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movss   xmm2<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movss   xmm0<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
<span style="color: #666666; font-style: italic;">; 394  :        const float rsqt = 1.0f / sqrt(l);</span>
<span style="color: #666666; font-style: italic;">; 395  :        vOut[0] = vIn[0] * rsqt;</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        movaps  xmm3<span style="color: #339933;">,</span> xmm2
        mulss   xmm3<span style="color: #339933;">,</span> xmm2
        movaps  xmm4<span style="color: #339933;">,</span> xmm1
        mulss   xmm4<span style="color: #339933;">,</span> xmm1
        addss   xmm3<span style="color: #339933;">,</span> xmm4
        movaps  xmm4<span style="color: #339933;">,</span> xmm0
        mulss   xmm4<span style="color: #339933;">,</span> xmm0
        addss   xmm3<span style="color: #339933;">,</span> xmm4
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _l$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm3
        sqrtss  xmm4<span style="color: #339933;">,</span> xmm3   <span style="color: #666666; font-style: italic;">;; slow full-precision square root gets stored in xmm4</span>
        movss   xmm3<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> __real@3f800000  <span style="color: #666666; font-style: italic;">;; store 1.0 in xmm3</span>
        divss   xmm3<span style="color: #339933;">,</span> xmm4  <span style="color: #666666; font-style: italic;">;; divide 1.0 / xmm4 to get the reciprocal square root !?!</span>
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _rsqt$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm3
&nbsp;
<span style="color: #666666; font-style: italic;">; 396  :        vOut[1] = vIn[1] * rsqt;</span>
<span style="color: #666666; font-style: italic;">; 397  :        vOut[2] = vIn[2] * rsqt;</span>
<span style="color: #666666; font-style: italic;">; 398  :        return rsqt * l;</span>
&nbsp;
        <span style="color: #0000ff; font-weight: bold;">fld</span>     <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _rsqt$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        mulss   xmm2<span style="color: #339933;">,</span> xmm3
        <span style="color: #0000ff; font-weight: bold;">fmul</span>    <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _l$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">esp</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
        mulss   xmm1<span style="color: #339933;">,</span> xmm3
        mulss   xmm0<span style="color: #339933;">,</span> xmm3
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm2
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm1
        movss   <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm0
&nbsp;
<span style="color: #666666; font-style: italic;">; 399  : }</span>
&nbsp;
        <span style="color: #00007f; font-weight: bold;">pop</span>     <span style="color: #00007f;">ecx</span>
        <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?TestNormalize@@YAMPIAMPIBM@Z <span style="color: #000000; font-weight: bold;">ENDP</span>                      <span style="color: #666666; font-style: italic;">; TestNormalize</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p></div></p>
<p><a style="display:none;" id="ddetlink808415792" href="javascript:expand(document.getElementById('ddet808415792'))">Vector normalize, hand-written SSE scalar by intrinsics</a>
<div class="ddet_div" id="ddet808415792"><script language="JavaScript" type="text/javascript">expand(document.getElementById('ddet808415792'));expand(document.getElementById('ddetlink808415792'))</script>
<u>Source</u></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #666666;">// SSE scalar reciprocal sqrt using rsqrt op, plus one Newton-Rhaphson iteration</span>
<span style="color: #0000ff;">inline</span> __m128 SSERSqrtNR<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">const</span> __m128 x <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	__m128 recip <span style="color: #000080;">=</span> _mm_rsqrt_ss<span style="color: #008000;">&#40;</span> x <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>  <span style="color: #666666;">// &quot;estimate&quot; opcode</span>
	<span style="color: #0000ff;">const</span> <span style="color: #0000ff;">static</span> __m128 three <span style="color: #000080;">=</span> <span style="color: #008000;">&#123;</span> <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">3</span>, <span style="color: #0000dd;">3</span> <span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// aligned consts for fast load</span>
	<span style="color: #0000ff;">const</span> <span style="color: #0000ff;">static</span> __m128 half <span style="color: #000080;">=</span> <span style="color: #008000;">&#123;</span> <span style="color:#800080;">0.5</span>,<span style="color:#800080;">0.5</span>,<span style="color:#800080;">0.5</span>,<span style="color:#800080;">0.5</span> <span style="color: #008000;">&#125;</span><span style="color: #008080;">;</span>
	__m128 halfrecip <span style="color: #000080;">=</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> half, recip <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	__m128 threeminus_xrr <span style="color: #000080;">=</span> _mm_sub_ss<span style="color: #008000;">&#40;</span> three, _mm_mul_ss<span style="color: #008000;">&#40;</span> x, _mm_mul_ss <span style="color: #008000;">&#40;</span> recip, recip <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">return</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> halfrecip, threeminus_xrr <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
&nbsp;
<span style="color: #0000ff;">inline</span> __m128 SSE_ScalarTestNormalizeFast<span style="color: #008000;">&#40;</span>  <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vOut, <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vIn <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
        __m128 x <span style="color: #000080;">=</span> _mm_load_ss<span style="color: #008000;">&#40;</span><span style="color: #000040;">&amp;</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        __m128 y <span style="color: #000080;">=</span> _mm_load_ss<span style="color: #008000;">&#40;</span><span style="color: #000040;">&amp;</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        __m128 z <span style="color: #000080;">=</span> _mm_load_ss<span style="color: #008000;">&#40;</span><span style="color: #000040;">&amp;</span>vIn<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
        <span style="color: #0000ff;">const</span> __m128 l <span style="color: #000080;">=</span>  <span style="color: #666666;">// compute x*x + y*y + z*z</span>
                _mm_add_ss<span style="color: #008000;">&#40;</span>
                 _mm_add_ss<span style="color: #008000;">&#40;</span> _mm_mul_ss<span style="color: #008000;">&#40;</span>x,x<span style="color: #008000;">&#41;</span>,
                             _mm_mul_ss<span style="color: #008000;">&#40;</span>y,y<span style="color: #008000;">&#41;</span>
                            <span style="color: #008000;">&#41;</span>,
                 _mm_mul_ss<span style="color: #008000;">&#40;</span> z, z <span style="color: #008000;">&#41;</span>
                <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
&nbsp;
        <span style="color: #0000ff;">const</span> __m128 rsqt <span style="color: #000080;">=</span> SSERSqrtNR<span style="color: #008000;">&#40;</span> l <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_store_ss<span style="color: #008000;">&#40;</span> <span style="color: #000040;">&amp;</span>vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span> , _mm_mul_ss<span style="color: #008000;">&#40;</span> rsqt, x <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_store_ss<span style="color: #008000;">&#40;</span> <span style="color: #000040;">&amp;</span>vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span> , _mm_mul_ss<span style="color: #008000;">&#40;</span> rsqt, y <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_store_ss<span style="color: #008000;">&#40;</span> <span style="color: #000040;">&amp;</span>vOut<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span> , _mm_mul_ss<span style="color: #008000;">&#40;</span> rsqt, z <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
        <span style="color: #0000ff;">return</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> l , rsqt <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><u>Assembly</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">PROC</span> <span style="color: #666666; font-style: italic;">; SSE_ScalarTestNormalizeFast, COMDAT</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">push</span>    <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ebp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">esp</span>
    <span style="color: #00007f; font-weight: bold;">and</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000ff;">16</span>                                <span style="color: #666666; font-style: italic;">; fffffff0H</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movss   xmm0<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
    movss   xmm3<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
    movaps  xmm7<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?three@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    movaps  xmm2<span style="color: #339933;">,</span> xmm0
    movss   xmm0<span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movaps  xmm4<span style="color: #339933;">,</span> xmm0
    movaps  xmm0<span style="color: #339933;">,</span> xmm2
    mulss   xmm0<span style="color: #339933;">,</span> xmm2
    movaps  xmm1<span style="color: #339933;">,</span> xmm3
    mulss   xmm1<span style="color: #339933;">,</span> xmm3
    addss   xmm0<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> xmm4
    mulss   xmm1<span style="color: #339933;">,</span> xmm4
    addss   xmm0<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> xmm0
    rsqrtss xmm1<span style="color: #339933;">,</span> xmm1
    movaps  xmm5<span style="color: #339933;">,</span> xmm1
    mulss   xmm1<span style="color: #339933;">,</span> xmm5
    movaps  xmm6<span style="color: #339933;">,</span> xmm0
    mulss   xmm6<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?half@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    mulss   xmm1<span style="color: #339933;">,</span> xmm5
    subss   xmm7<span style="color: #339933;">,</span> xmm6
    mulss   xmm1<span style="color: #339933;">,</span> xmm7
    movaps  xmm5<span style="color: #339933;">,</span> xmm1
    mulss   xmm5<span style="color: #339933;">,</span> xmm2
    movss   XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm5
    movaps  xmm2<span style="color: #339933;">,</span> xmm1
    mulss   xmm2<span style="color: #339933;">,</span> xmm3
&nbsp;
    movss   XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">4</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm2
    movaps  xmm2<span style="color: #339933;">,</span> xmm1
    mulss   xmm2<span style="color: #339933;">,</span> xmm4
&nbsp;
    movss   XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #339933;">+</span><span style="color: #0000ff;">8</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm2
&nbsp;
    mulss   xmm0<span style="color: #339933;">,</span> xmm1
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">pop</span>     <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?SSE_ScalarTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">ENDP</span> <span style="color: #666666; font-style: italic;">; SSE_ScalarTestNormalizeFast</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p></div></p>
<p><a style="display:none;" id="ddetlink1323975038" href="javascript:expand(document.getElementById('ddet1323975038'))">Vector normalize, hand-written SSE SIMD by intrinsics</a>
<div class="ddet_div" id="ddet1323975038"><script language="JavaScript" type="text/javascript">expand(document.getElementById('ddet1323975038'));expand(document.getElementById('ddetlink1323975038'))</script>
<u>Source</u></p>

<div class="wp_syntax"><div class="code"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">inline</span> __m128 SSE_SIMDTestNormalizeFast<span style="color: #008000;">&#40;</span> <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vOut, <span style="color: #0000ff;">float</span> <span style="color: #000040;">*</span> RESTRICT vIn  <span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
        <span style="color: #666666;">// load as a SIMD vector</span>
        <span style="color: #0000ff;">const</span> __m128 vec <span style="color: #000080;">=</span> _mm_loadu_ps<span style="color: #008000;">&#40;</span>vIn<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        <span style="color: #666666;">// compute a dot product by computing the square, and</span>
        <span style="color: #666666;">// then rotating the vector and adding, so that the</span>
        <span style="color: #666666;">// dot ends up in the low term (used by the scalar ops)</span>
        __m128 dot <span style="color: #000080;">=</span> _mm_mul_ps<span style="color: #008000;">&#40;</span> vec, vec <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        <span style="color: #666666;">// rotate x under y and add together   </span>
        __m128 rotated <span style="color: #000080;">=</span> _mm_shuffle_ps<span style="color: #008000;">&#40;</span> dot, dot, _MM_SHUFFLE<span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">0</span>,<span style="color: #0000dd;">3</span>,<span style="color: #0000dd;">2</span>,<span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// YZWX ( shuffle macro is high to low word )</span>
        dot <span style="color: #000080;">=</span> _mm_add_ss<span style="color: #008000;">&#40;</span> dot, rotated <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// x^2 + y^2 in the low word</span>
        rotated <span style="color: #000080;">=</span> _mm_shuffle_ps<span style="color: #008000;">&#40;</span> rotated, rotated, _MM_SHUFFLE<span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">0</span>,<span style="color: #0000dd;">3</span>,<span style="color: #0000dd;">2</span>,<span style="color: #0000dd;">1</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// ZWXY</span>
        dot <span style="color: #000080;">=</span> _mm_add_ss<span style="color: #008000;">&#40;</span> dot, rotated <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// x^2 + y^2 + z^2 in the low word</span>
&nbsp;
        __m128 recipsqrt <span style="color: #000080;">=</span> SSERSqrtNR<span style="color: #008000;">&#40;</span> dot <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// contains reciprocal square root in low term</span>
        recipsqrt <span style="color: #000080;">=</span> _mm_shuffle_ps<span style="color: #008000;">&#40;</span> recipsqrt, recipsqrt, _MM_SHUFFLE<span style="color: #008000;">&#40;</span> <span style="color: #0000dd;">0</span>, <span style="color: #0000dd;">0</span>, <span style="color: #0000dd;">0</span>, <span style="color: #0000dd;">0</span> <span style="color: #008000;">&#41;</span> <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> <span style="color: #666666;">// broadcast low term to all words</span>
&nbsp;
        <span style="color: #666666;">// multiply 1/sqrt(dotproduct) against all vector components, and write back</span>
        <span style="color: #0000ff;">const</span> __m128 normalized <span style="color: #000080;">=</span> _mm_mul_ps<span style="color: #008000;">&#40;</span> vec, recipsqrt <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        _mm_storeu_ps<span style="color: #008000;">&#40;</span>vOut, normalized<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
        <span style="color: #0000ff;">return</span> _mm_mul_ss<span style="color: #008000;">&#40;</span> dot , recipsqrt <span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p><u>Assembly</u></p>

<div class="wp_syntax"><div class="code"><pre class="asm" style="font-family:monospace;">_TEXT   <span style="color: #000000; font-weight: bold;">SEGMENT</span>
_vOut$ = <span style="color: #0000ff;">8</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
_vIn$ = <span style="color: #0000ff;">12</span>                                              <span style="color: #666666; font-style: italic;">; size = 4</span>
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">PROC</span>   <span style="color: #666666; font-style: italic;">; SSE_SIMDTestNormalizeFast, COMDAT</span>
&nbsp;
&nbsp;
    <span style="color: #00007f; font-weight: bold;">push</span>    <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ebp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">esp</span>
    <span style="color: #00007f; font-weight: bold;">and</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #339933;">-</span><span style="color: #0000ff;">16</span>                                <span style="color: #666666; font-style: italic;">; fffffff0H</span>
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">eax</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vIn$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movups  xmm2<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">eax</span><span style="color: #009900; font-weight: bold;">&#93;</span> <span style="color: #666666; font-style: italic;">;; load the input vector</span>
    movaps  xmm5<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?three@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B <span style="color: #666666; font-style: italic;">;; load the constant &quot;3&quot;</span>
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">ecx</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">DWORD</span> <span style="color: #000000; font-weight: bold;">PTR</span> _vOut$<span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ebp</span><span style="color: #009900; font-weight: bold;">&#93;</span>
    movaps  xmm0<span style="color: #339933;">,</span> xmm2
    mulps   xmm0<span style="color: #339933;">,</span> xmm2
    movaps  xmm1<span style="color: #339933;">,</span> xmm0
    shufps  xmm1<span style="color: #339933;">,</span> xmm0<span style="color: #339933;">,</span> <span style="color: #0000ff;">57</span>	<span style="color: #666666; font-style: italic;">; shuffle to YZWX</span>
    addss   xmm0<span style="color: #339933;">,</span> xmm1      <span style="color: #666666; font-style: italic;">; add Y to low word of xmm0</span>
    shufps  xmm1<span style="color: #339933;">,</span> xmm1<span style="color: #339933;">,</span> <span style="color: #0000ff;">57</span>	<span style="color: #666666; font-style: italic;">; shuffle to ZWXY</span>
    addss   xmm0<span style="color: #339933;">,</span> xmm1      <span style="color: #666666; font-style: italic;">; add Z to low word of xmm0</span>
&nbsp;
    movaps  xmm1<span style="color: #339933;">,</span> xmm0        
    rsqrtss xmm1<span style="color: #339933;">,</span> xmm1      <span style="color: #666666; font-style: italic;">; reciprocal square root estimate</span>
    movaps  xmm3<span style="color: #339933;">,</span> xmm1
    mulss   xmm1<span style="color: #339933;">,</span> xmm3
    movaps  xmm4<span style="color: #339933;">,</span> xmm0
    mulss   xmm4<span style="color: #339933;">,</span> xmm1
    movaps  xmm1<span style="color: #339933;">,</span> XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> ?half@?<span style="color: #0000ff;">1</span>??SSERSqrtNR@@YA?AT__m128@@T2@@Z@4T2@B
    mulss   xmm1<span style="color: #339933;">,</span> xmm3
    subss   xmm5<span style="color: #339933;">,</span> xmm4
    mulss   xmm1<span style="color: #339933;">,</span> xmm5      <span style="color: #666666; font-style: italic;">; Newton-Raphson finishes here; 1/sqrt(dot) is in xmm1's low word</span>
&nbsp;
    shufps  xmm1<span style="color: #339933;">,</span> xmm1<span style="color: #339933;">,</span> <span style="color: #0000ff;">0</span>   <span style="color: #666666; font-style: italic;">; broadcast so that xmm1 has 1/sqrt(dot) in all words</span>
    movaps  xmm3<span style="color: #339933;">,</span> xmm1
    mulps   xmm3<span style="color: #339933;">,</span> xmm2      <span style="color: #666666; font-style: italic;">; multiply all words of original vector by 1/sqrt(dot)</span>
    movups  XMMWORD <span style="color: #000000; font-weight: bold;">PTR</span> <span style="color: #009900; font-weight: bold;">&#91;</span><span style="color: #00007f;">ecx</span><span style="color: #009900; font-weight: bold;">&#93;</span><span style="color: #339933;">,</span> xmm3   <span style="color: #666666; font-style: italic;">; unaligned save to memory</span>
&nbsp;
	<span style="color: #666666; font-style: italic;">; return dot * 1 / sqrt(dot) == sqrt(dot) == length of vector</span>
    mulss   xmm0<span style="color: #339933;">,</span> xmm1
&nbsp;
    <span style="color: #00007f; font-weight: bold;">mov</span>     <span style="color: #00007f;">esp</span><span style="color: #339933;">,</span> <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">pop</span>     <span style="color: #00007f;">ebp</span>
    <span style="color: #00007f; font-weight: bold;">ret</span>     <span style="color: #0000ff;">0</span>
?SSE_SIMDTestNormalizeFast@@YA?AT__m128@@PIAM0@Z <span style="color: #000000; font-weight: bold;">ENDP</span>   <span style="color: #666666; font-style: italic;">; SSE_SIMDTestNormalizeFast</span>
_TEXT   <span style="color: #000000; font-weight: bold;">ENDS</span></pre></div></div>

<p></div></p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Happy Holidays! (no post today)</title>
		<link>http://assemblyrequired.crashworks.org/2008/12/29/happy-holidays-08/</link>
		<comments>http://assemblyrequired.crashworks.org/2008/12/29/happy-holidays-08/#comments</comments>
		<pubDate>Mon, 29 Dec 2008 17:00:43 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=99</guid>
		<description><![CDATA[This is the week when all of Santa&#8217;s Elves who work in game-making-places all over the world take a hopefully well deserved break, so no game development post today. Instead, here&#8217;s some nice audio books and podcasts that I&#8217;ve enjoyed on my commutes in the past year. Happy Holidays!

Escape Pod&#8217;s short-fiction podcast has not one [...]]]></description>
			<content:encoded><![CDATA[<p>This is the week when all of Santa&#8217;s Elves who work in game-making-places all over the world take a hopefully well deserved break, so no game development post today. Instead, here&#8217;s some nice audio books and podcasts that I&#8217;ve enjoyed on my commutes in the past year. Happy Holidays!</p>
<ul>
<li><a href="http://escapepod.org/">Escape Pod</a>&#8217;s short-fiction podcast has not <a href="http://escapepod.org/2007/12/25/ep138-in-the-late-december/">one</a> but <a href="http://escapepod.org/2008/12/25/ep184-as-dry-leaves-that-before-the-wild-hurricane-fly/">two</a> Christmas specials.</li>
<li>X Minus One, <a href="http://www.podango.com/podcast_episode/2067/91797/X_Minus_One_Podcast/X_Minus_One_67_Lifeboat_Mutiny"><em>The Lifeboat Mutiny</em></a>. 50s-era science fiction radio play: two prospectors find themselves in possession of a lifeboat that&#8217;s more than it seems. Worth listening to for the Drone National Anthem alone!</li>
<li>G.K. Chesterton&#8217;s  <a href="http://librivox.org/the-wisdom-of-father-brown-by-g-k-chesterton/"><em>The Wisdom Of Father Brown</em></a> is available as a public-domain audio book from LibriVox, read by <a href="http://web.mac.com/martin.clifton/iWeb/Martin%20Clifton/Home.html">Martin Clifton</a>. I found this collection of short stories about the eponymous priest-turned-private-detective to be fine treadmill listening.</li>
<li>WNYC&#8217;s science documentary special <em><a href="http://www.wnyc.org/shows/radiolab/">Radio Lab</a> </em>is always a singular pleasure.</li>
<li>And <a href="http://www.subterraneanpress.com/">Subterranean Press</a> was kind enough to produce a free audio book of Charlie Stross&#8217; <a href="http://subterraneanpress.com/index.php/magazine/winter-2008/audio-trunk-and-disorderly-by-charles-stross/"><em>Trunk and Disorderly</em></a>, a humorous adventure in a style that defies description (the closest anyone&#8217;s come is &#8220;Heinlein meets Wodehouse in space with a woolly mammoth&#8221;).</li>
</ul>
<p>I&#8217;ll be back next week with some thoughts on replacing floating-point conditionals with branchless math.</p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2008/12/29/happy-holidays-08/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A really specific Facebook phishing virus?</title>
		<link>http://assemblyrequired.crashworks.org/2008/07/27/a-really-specific-facebook-phishing-virus/</link>
		<comments>http://assemblyrequired.crashworks.org/2008/07/27/a-really-specific-facebook-phishing-virus/#comments</comments>
		<pubDate>Sun, 27 Jul 2008 10:33:15 +0000</pubDate>
		<dc:creator>Elan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://assemblyrequired.wordpress.com/?p=10</guid>
		<description><![CDATA[This message I received on Facebook shows how phishers are using social networks to make their messages appear to come from a legitimate, trusted source.]]></description>
			<content:encoded><![CDATA[<p>I just received a message through Facebook that points at how malware authors can be really, really specific with their attack vectors, and how they exploit social networks to make their messages appear to come from a legitimate, trusted source.</p>
<p>We&#8217;ve all received links to malware websites in email, of course, and usually we can reject them out of hand because the sender is obviously fake. If the sender&#8217;s name is someone you actually know and trust, you&#8217;re more likely to open the email, but knowing how easily email headers can be forged you might still be a little suspicious. But Facebook messages require someone to have logged in and authenticated &#8212; if I get a message from my friend Tim, it means that Tim has actually gone to facebook.com and opened up his little message thingy and typed something out to me.</p>
<p>Or, at least, Tim&#8217;s browser has.</p>
<p>I got this Wall post on Facebook earlier today from an old friend I&#8217;ve had in my list since forever.</p>
<p style="text-align:center;"><img class="aligncenter" src="http://www.collabi.net/journalpix/facebook_hack_1_m.jpg" alt="I got a free iPhone! Go to (URL) to get yours!" width="593" height="168" /></p>
<p>At first I was a little confused, because I couldn&#8217;t see how a phisher could forge a Facebook message, but it was still suspicious enough that I accessed the linked website with <a href="http://www.python.org/doc/lib/module-urllib.html">kid gloves</a> — and indeed, it&#8217;s a webpage that hosts the Javascript malware enclosed below the jump (posted as an image for your safety). I didn&#8217;t bother to decode exactly what it does, but it clearly decrypts some kind of encrypted exploit into the browser and executes it.</p>
<p>There&#8217;s a couple of interesting things about this: first, the incredible specificity of this virus to maximize the chances that it would appear to come to me from a trusted source. Unless the virus author has hacked into Facebook&#8217;s back end, the only way this could work is if the virus snagged my friend Tim&#8217;s Facebook password, then logged into his account on its own, accessed his friends list, then mechanically transmitted that message to all those friends. This is a lot of specific code to write, stuff that reads out of Facebook&#8217;s HTML and knows how to find the friends list there, and then and knows how to navigate the website to send out messages. It&#8217;s a lot of work to take advantage of <em>one</em> social networking site, which goes to show how valuable it is to take advantage of our own assumptions of trust in our friends (or how little virus writers value their time).</p>
<p>Second: even though the actual malware is hosted on the Google-owned <a href="http://en.wikipedia.org/wiki/Blogspot">Blogspot</a> service, <a href="http://www.google.com/safebrowsing/diagnostic?site=http://shelleyhipysi.blogspot.com/">Google&#8217;s own malware-detecting tools don&#8217;t list it as malicious</a>. In fact, when I tested the site with <a href="http://www.google.com/safebrowsing">Google Safe Browsing</a>, it told me &#8220;This site is not listed as suspicious&#8221; and &#8220;Google has not visited this site within the past 90 days&#8221;, which is to say that Google can&#8217;t even patrol <em>its own webhosting service</em> for exploits. Maybe that explains why <a href="http://www.pcpro.co.uk/news/214371/google-blogger-hosts-2-of-worlds-malware.html">Google hosts 2% of the world&#8217;s malware</a>.</p>
<p>Here&#8217;s the source code for the exploiting Blogspot page that the URL goes to, in case you feel like figuring out what it does. I&#8217;m posting it as an image to prevent any chance of that Javascript from actually being parsed on your browser.</p>
<p><span id="more-20"></span></p>
<p><a href="HTML and Javascript source code."><img class="alignnone" src="http://www.collabi.net/journalpix/facebook_hack_2.png" alt="" width="442" height="326" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://assemblyrequired.crashworks.org/2008/07/27/a-really-specific-facebook-phishing-virus/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
