<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Timing square root</title>
	<atom:link href="http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/feed/" rel="self" type="application/rss+xml" />
	<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/</link>
	<description>Technical Notes On Game Development</description>
	<lastBuildDate>Wed, 28 Dec 2011 10:00:34 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Wrong On The Internet! &#124; Rambling Llamas</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-6173</link>
		<dc:creator>Wrong On The Internet! &#124; Rambling Llamas</dc:creator>
		<pubDate>Thu, 03 Nov 2011 21:58:03 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-6173</guid>
		<description>[...] this guy went into it in much more thorough detail if you&#8217;re interested!)   This entry was posted in [...]</description>
		<content:encoded><![CDATA[<p>[...] this guy went into it in much more thorough detail if you&#8217;re interested!)   This entry was posted in [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bruce Dawson</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-6070</link>
		<dc:creator>Bruce Dawson</dc:creator>
		<pubDate>Sun, 23 Oct 2011 05:22:12 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-6070</guid>
		<description>One part of the reason that the x87 square root is slower is that it is probably calculating square root to double precision. The x87 hardware defaults to calculating square root to 80-bit precision, the C run-time changes the setting so it calculates it to 64-bit precision, but if you change it to calculate it to 32-bit precision (controlfp) then it will be a bit faster. Square root and float-divide timings on x87 are both affected by the selected precision. I don&#039;t think any other operations are.

Of course, changing a global (well, global per-thread) setting in order to get a local optimization is pretty crazy, so your points stand.</description>
		<content:encoded><![CDATA[<p>One part of the reason that the x87 square root is slower is that it is probably calculating square root to double precision. The x87 hardware defaults to calculating square root to 80-bit precision, the C run-time changes the setting so it calculates it to 64-bit precision, but if you change it to calculate it to 32-bit precision (controlfp) then it will be a bit faster. Square root and float-divide timings on x87 are both affected by the selected precision. I don&#8217;t think any other operations are.</p>
<p>Of course, changing a global (well, global per-thread) setting in order to get a local optimization is pretty crazy, so your points stand.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jalf</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5272</link>
		<dc:creator>jalf</dc:creator>
		<pubDate>Thu, 11 Nov 2010 14:18:15 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5272</guid>
		<description>Did you ever get around to cleaning up the source code?</description>
		<content:encoded><![CDATA[<p>Did you ever get around to cleaning up the source code?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Yuhong Bao</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5243</link>
		<dc:creator>Yuhong Bao</dc:creator>
		<pubDate>Mon, 20 Sep 2010 07:14:03 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5243</guid>
		<description>&quot;If your customer’s PC has DirectX 9, it has SSE2.&quot;
Not necessarily, as the graphics card choice is independent of the CPU choice.</description>
		<content:encoded><![CDATA[<p>&#8220;If your customer’s PC has DirectX 9, it has SSE2.&#8221;<br />
Not necessarily, as the graphics card choice is independent of the CPU choice.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bit operations on different datatypes</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5230</link>
		<dc:creator>Bit operations on different datatypes</dc:creator>
		<pubDate>Wed, 08 Sep 2010 20:20:39 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5230</guid>
		<description>[...]  [...]</description>
		<content:encoded><![CDATA[<p>[...]  [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gregory</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5165</link>
		<dc:creator>Gregory</dc:creator>
		<pubDate>Wed, 24 Feb 2010 13:38:22 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5165</guid>
		<description>Hello Elan,

Thank you for the answer.

Since I posted my comment, I came across http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx so I think I&#039;m going to implement my high precision timer using QPC on Windows and gettimeofday on Linux and Mac.

Or the easy way seems to be Boost.PTime</description>
		<content:encoded><![CDATA[<p>Hello Elan,</p>
<p>Thank you for the answer.</p>
<p>Since I posted my comment, I came across <a href="http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx" rel="nofollow">http://msdn.microsoft.com/en-us/library/bb173458(VS.85).aspx</a> so I think I&#8217;m going to implement my high precision timer using QPC on Windows and gettimeofday on Linux and Mac.</p>
<p>Or the easy way seems to be Boost.PTime</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elan</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5163</link>
		<dc:creator>Elan</dc:creator>
		<pubDate>Tue, 23 Feb 2010 04:28:21 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5163</guid>
		<description>We use rtdsc to implement our cycle counter, since it seems to have the best resolution of any of the timers available.

It&#039;s best to put target code in a loop for the same reason you would weigh rice by the thousand rather than one grain at a time on your kitchen scale: systematic and random error.

First, there is a certain &lt;a href=&quot;http://en.wikipedia.org/wiki/Systematic_error&quot; rel=&quot;nofollow&quot;&gt;systematic error&lt;/a&gt; in querying the cycle counter -- the StartCycleCounter() inline function call and the rtdsc op itself have some latency, and the timer is probably only accurate to within a couple of nanoseconds, measuring a single iteration of an 86ns operation would have a large relative error. On the other hand, a relative error of 10&lt;sup&gt;-8&lt;/sup&gt; in 10&lt;sup&gt;-3&lt;/sup&gt; seconds is much smaller, and so more accurate. 

Also, timings &lt;em&gt;in vivo&lt;/em&gt; can be &lt;a href=&quot;http://en.wikipedia.org/wiki/Random_error&quot; rel=&quot;nofollow&quot;&gt;messy&lt;/a&gt;: any single iteration of the loop might take a little longer than expected because of other threads, memory bus contention, clock variability, operating system intervention, even CPU temperature. Taking multiple measurements, or a single measurement of multiple iterations, improves statistical significance and narrows the &lt;a href=&quot;http://en.wikipedia.org/wiki/Confidence_interval&quot; rel=&quot;nofollow&quot;&gt;error bars&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>We use rtdsc to implement our cycle counter, since it seems to have the best resolution of any of the timers available.</p>
<p>It&#8217;s best to put target code in a loop for the same reason you would weigh rice by the thousand rather than one grain at a time on your kitchen scale: systematic and random error.</p>
<p>First, there is a certain <a href="http://en.wikipedia.org/wiki/Systematic_error" rel="nofollow">systematic error</a> in querying the cycle counter &#8212; the StartCycleCounter() inline function call and the rtdsc op itself have some latency, and the timer is probably only accurate to within a couple of nanoseconds, measuring a single iteration of an 86ns operation would have a large relative error. On the other hand, a relative error of 10<sup>-8</sup> in 10<sup>-3</sup> seconds is much smaller, and so more accurate. </p>
<p>Also, timings <em>in vivo</em> can be <a href="http://en.wikipedia.org/wiki/Random_error" rel="nofollow">messy</a>: any single iteration of the loop might take a little longer than expected because of other threads, memory bus contention, clock variability, operating system intervention, even CPU temperature. Taking multiple measurements, or a single measurement of multiple iterations, improves statistical significance and narrows the <a href="http://en.wikipedia.org/wiki/Confidence_interval" rel="nofollow">error bars</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gregory</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5155</link>
		<dc:creator>Gregory</dc:creator>
		<pubDate>Wed, 17 Feb 2010 19:14:26 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5155</guid>
		<description>I&#039;m curious about what&#039;s behind StartClockCycleCounter(); and StopClockCycleCounter();

What&#039;s the best way to benchmark a portion of code on PC? QueryPerformanceCounter (windows only), RTDSC? something else?

And is there a point putting the target code inside a loop?</description>
		<content:encoded><![CDATA[<p>I&#8217;m curious about what&#8217;s behind StartClockCycleCounter(); and StopClockCycleCounter();</p>
<p>What&#8217;s the best way to benchmark a portion of code on PC? QueryPerformanceCounter (windows only), RTDSC? something else?</p>
<p>And is there a point putting the target code inside a loop?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elan</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5081</link>
		<dc:creator>Elan</dc:creator>
		<pubDate>Thu, 29 Oct 2009 22:25:49 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5081</guid>
		<description>Elias: 4096 loops over 4096 floats with &lt;code&gt;x = powf(x, 0.5f)&lt;/code&gt; took 1449.538ms, or  86.4ns/float. This is 3.6 times worse than even the compiler&#039;s naive x87 &lt;code&gt;sqrt(x)&lt;/code&gt;, and twenty-seven times worse than rsqrtss with one step of Newton-Rhapson iteration (which is equally accurate). Taking an exponent is a function call, and a very slow function call at that.</description>
		<content:encoded><![CDATA[<p>Elias: 4096 loops over 4096 floats with <code>x = powf(x, 0.5f)</code> took 1449.538ms, or  86.4ns/float. This is 3.6 times worse than even the compiler&#8217;s naive x87 <code>sqrt(x)</code>, and twenty-seven times worse than rsqrtss with one step of Newton-Rhapson iteration (which is equally accurate). Taking an exponent is a function call, and a very slow function call at that.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elias</title>
		<link>http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/comment-page-1/#comment-5079</link>
		<dc:creator>Elias</dc:creator>
		<pubDate>Thu, 29 Oct 2009 20:43:50 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=234#comment-5079</guid>
		<description>Here is one technique you apparently haven&#039;t tried: taking the number to the 0.5 power. How does that compare?</description>
		<content:encoded><![CDATA[<p>Here is one technique you apparently haven&#8217;t tried: taking the number to the 0.5 power. How does that compare?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

