<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: How Slow Are Virtual Functions Really?</title>
	<atom:link href="http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/feed/" rel="self" type="application/rss+xml" />
	<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/</link>
	<description>Technical Notes On Game Development</description>
	<lastBuildDate>Wed, 21 Jul 2010 18:22:16 -0700</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Mark&#8217;s Testblog &#187; Blog Archive &#187; Data oriented design links - &#8230;for these are testing times, indeed.</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-5224</link>
		<dc:creator>Mark&#8217;s Testblog &#187; Blog Archive &#187; Data oriented design links - &#8230;for these are testing times, indeed.</dc:creator>
		<pubDate>Wed, 21 Jul 2010 18:22:16 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-5224</guid>
		<description>[...] catch-all declarations of &#8220;virtual functions are slow!&#8221;  For the general case, this can be proved as a nonsense as virtual function calls are blatantly not slow!  They&#8217;re very, very fast.  However, if [...]</description>
		<content:encoded><![CDATA[<p>[...] catch-all declarations of &#8220;virtual functions are slow!&#8221;  For the general case, this can be proved as a nonsense as virtual function calls are blatantly not slow!  They&#8217;re very, very fast.  However, if [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: On _purecall and the Overhead(s) of Virtual Functions &#171; Ofek&#8217;s Visual C++ stuff</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-5213</link>
		<dc:creator>On _purecall and the Overhead(s) of Virtual Functions &#171; Ofek&#8217;s Visual C++ stuff</dc:creator>
		<pubDate>Thu, 03 Jun 2010 20:03:40 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-5213</guid>
		<description>[...] calls are known to be more costly than calls that are resolved at compile time.&#160; Elan Ruskin measured ~50% difference &#8211; I measured a bit less, but the difference is certainly there. For functions [...]</description>
		<content:encoded><![CDATA[<p>[...] calls are known to be more costly than calls that are resolved at compile time.&#160; Elan Ruskin measured ~50% difference &#8211; I measured a bit less, but the difference is certainly there. For functions [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Virtual functions &#8211; an experiment &#124; .mischief.mayhem.soap.</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-4930</link>
		<dc:creator>Virtual functions &#8211; an experiment &#124; .mischief.mayhem.soap.</dc:creator>
		<pubDate>Sun, 14 Jun 2009 13:17:19 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-4930</guid>
		<description>[...] months ago I&#8217;ve read this article by Ellan Ruskin, when he measures the overhead of virtual functions. There&#8217;s also a follow-up [...]</description>
		<content:encoded><![CDATA[<p>[...] months ago I&#8217;ve read this article by Ellan Ruskin, when he measures the overhead of virtual functions. There&#8217;s also a follow-up [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Allan</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-4424</link>
		<dc:creator>Allan</dc:creator>
		<pubDate>Thu, 30 Apr 2009 13:12:17 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-4424</guid>
		<description>If you have a tight loop, the first time you hit the virtual call, you are highly likely to have an L2 cache miss on the vtable lookup, which on 360 is ~610 cycles.
Then you are very likely to have a branch mispredict, which can be another 24 cycles while you fetch instructions from the *right* place, then carry on.
As Steven points out, the additional cost can be inconsequential and obviously many other criteria (ease of expression, how else can I do virtual style indirection without an L2 miss, overall engineering simplicity) come into play. 
But you do get small accessors appearing in performance profiles a lot, and often the author may not have given it any thought and assumed it would be inlined.Also on 360, we often spot overzealous use of virtual simply via spotting bctr penalties with the same target address 100% of the time. Thats a penalty you quite likely can just drop (all else equal).
I&#039;m not sure I understand Steven&#039;s point on the compiler being smart enough to directly call virtuals on a leaf class - do you mean virtual calls within the class? This would imply the compiler is able to tell that there are no other derived classes declared anywhere else in the project...?
Calling a virtual externally via a base class pointer, I&#039;m not sure how the compiler can know to call directly. To my knowledge, the compiler can only inline when it definitely knows the type of the object..
Object&amp; object; object-&gt;Virtual();             // Inline away
Object* pObject = ptr;  pObject-&gt;virtual();    // Can&#039;t inline, surely

Al</description>
		<content:encoded><![CDATA[<p>If you have a tight loop, the first time you hit the virtual call, you are highly likely to have an L2 cache miss on the vtable lookup, which on 360 is ~610 cycles.<br />
Then you are very likely to have a branch mispredict, which can be another 24 cycles while you fetch instructions from the *right* place, then carry on.<br />
As Steven points out, the additional cost can be inconsequential and obviously many other criteria (ease of expression, how else can I do virtual style indirection without an L2 miss, overall engineering simplicity) come into play.<br />
But you do get small accessors appearing in performance profiles a lot, and often the author may not have given it any thought and assumed it would be inlined.Also on 360, we often spot overzealous use of virtual simply via spotting bctr penalties with the same target address 100% of the time. Thats a penalty you quite likely can just drop (all else equal).<br />
I&#8217;m not sure I understand Steven&#8217;s point on the compiler being smart enough to directly call virtuals on a leaf class &#8211; do you mean virtual calls within the class? This would imply the compiler is able to tell that there are no other derived classes declared anywhere else in the project&#8230;?<br />
Calling a virtual externally via a base class pointer, I&#8217;m not sure how the compiler can know to call directly. To my knowledge, the compiler can only inline when it definitely knows the type of the object..<br />
Object&amp; object; object-&gt;Virtual();             // Inline away<br />
Object* pObject = ptr;  pObject-&gt;virtual();    // Can&#8217;t inline, surely</p>
<p>Al</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark James</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-103</link>
		<dc:creator>Mark James</dc:creator>
		<pubDate>Thu, 12 Feb 2009 22:54:28 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-103</guid>
		<description>&quot;I can tell you for sure that the problem is not the cost of looking up the indirect function pointer from the vtable&quot;

That&#039;s true if you make sure everything is in the cache.  If not, the vtable lookup is an extra cache miss, and you can get much worse performance.</description>
		<content:encoded><![CDATA[<p>&#8220;I can tell you for sure that the problem is not the cost of looking up the indirect function pointer from the vtable&#8221;</p>
<p>That&#8217;s true if you make sure everything is in the cache.  If not, the vtable lookup is an extra cache miss, and you can get much worse performance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steven</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-76</link>
		<dc:creator>Steven</dc:creator>
		<pubDate>Tue, 20 Jan 2009 23:29:42 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-76</guid>
		<description>The problems with indirect jumps (such as virtual) is that they are not entirely mapped as code, and I&#039;m not sure how the various CPUs deal with that. For example, for an Intel processor, functions[i](args) is considered normal data until the call is executed. The functions[i] part is a normal data read from a data segment. If you somehow can make sure the values from functions[i] are read sequentially, the MMU should manage to bring functions[i+1] in cache (and i+2, i+3,...). As for the actual value of functions[i] which is used to perform the call, yes, I guess you are right: it messes the jump prediction good, because it&#039;s data that is read from memory that is turned into code (so to speak).

The ~2.4x just shows that the Xenon has a very different architecture than AMD and Intel x86/x86_64 chips.</description>
		<content:encoded><![CDATA[<p>The problems with indirect jumps (such as virtual) is that they are not entirely mapped as code, and I&#8217;m not sure how the various CPUs deal with that. For example, for an Intel processor, functions[i](args) is considered normal data until the call is executed. The functions[i] part is a normal data read from a data segment. If you somehow can make sure the values from functions[i] are read sequentially, the MMU should manage to bring functions[i+1] in cache (and i+2, i+3,&#8230;). As for the actual value of functions[i] which is used to perform the call, yes, I guess you are right: it messes the jump prediction good, because it&#8217;s data that is read from memory that is turned into code (so to speak).</p>
<p>The ~2.4x just shows that the Xenon has a very different architecture than AMD and Intel x86/x86_64 chips.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Elan</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-74</link>
		<dc:creator>Elan</dc:creator>
		<pubDate>Tue, 20 Jan 2009 04:35:29 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-74</guid>
		<description>It&#039;s absolutely true that the more work the function itself does, the less relative weight the virtual function call overhead has compared to the work done inside the function. Still, it&#039;s not zero, and I was just curious to directly measure how much that overhead was.

Unfortunately the first version of this post had some incorrect numbers in the table for 50000 iterations: what I didn&#039;t realize until I looked my code over this morning was that the CPU-cycle counter I was using for a fast timer only had 32 bits of resolution, so it would alias for anything longer than about 1.5 seconds. It did seem odd that somehow the relative difference between virtual and inline function calls would get smaller the more times the test was run. Anyway I&#039;ve corrected the numbers above and it shows that the virtual overhead is the same for the larger test run as it is for the smaller test run, and a virtual call takes 2.37x as long as a direct call.

I think we&#039;re talking about the same thing when I say branch prediction and you say the MMU. In the Xenon, the instruction-fetcher is considered to be part of the CPU pipeline, not the MMU, and the address snooping is performed by the branch prediction stage of the CPU pipe. The MMU is responsible for hauling the instruction stream from main RAM into the cache, but in all of my test cases the code is small enough that it just stays in the icache after being loaded for the first time, and so cache misses don&#039;t figure into the calculation.

The problem with branch-predicting virtuals on the Xenon (and the Cell&#039;s PPC, which is nearly the same) is that the indirect branch op, &lt;code&gt;bctrl&lt;/code&gt; can&#039;t snoop the branch address to report to the fetch stage of the pipeline and so it continues to fetch instructions from the wrong address until the &lt;code&gt;bctrl&lt;/code&gt; executes. Static branches on the other hand are predicted quite early, and since the fetcher actually runs faster than the rest of the pipeline it can often fill the bubble before any CPU time is lost.  &lt;a href=&quot;http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars/3&quot; rel=&quot;nofollow&quot;&gt;This sketch of the CellPPC pipeline&lt;/a&gt; may make it slightly clearer: in that illustration, a static branch can be predicted by the eighth stage of the pipeline, meaning that  only eight instructions need to be annulled when the fetcher goes to the new address; an indirect branch on the other hand doesn&#039;t start fetching from the new address until it hits the EX3 stage.

I&#039;ll try to post my test code up soon, and write a followup to explain the pipeline issues more clearly.</description>
		<content:encoded><![CDATA[<p>It&#8217;s absolutely true that the more work the function itself does, the less relative weight the virtual function call overhead has compared to the work done inside the function. Still, it&#8217;s not zero, and I was just curious to directly measure how much that overhead was.</p>
<p>Unfortunately the first version of this post had some incorrect numbers in the table for 50000 iterations: what I didn&#8217;t realize until I looked my code over this morning was that the CPU-cycle counter I was using for a fast timer only had 32 bits of resolution, so it would alias for anything longer than about 1.5 seconds. It did seem odd that somehow the relative difference between virtual and inline function calls would get smaller the more times the test was run. Anyway I&#8217;ve corrected the numbers above and it shows that the virtual overhead is the same for the larger test run as it is for the smaller test run, and a virtual call takes 2.37x as long as a direct call.</p>
<p>I think we&#8217;re talking about the same thing when I say branch prediction and you say the MMU. In the Xenon, the instruction-fetcher is considered to be part of the CPU pipeline, not the MMU, and the address snooping is performed by the branch prediction stage of the CPU pipe. The MMU is responsible for hauling the instruction stream from main RAM into the cache, but in all of my test cases the code is small enough that it just stays in the icache after being loaded for the first time, and so cache misses don&#8217;t figure into the calculation.</p>
<p>The problem with branch-predicting virtuals on the Xenon (and the Cell&#8217;s PPC, which is nearly the same) is that the indirect branch op, <code>bctrl</code> can&#8217;t snoop the branch address to report to the fetch stage of the pipeline and so it continues to fetch instructions from the wrong address until the <code>bctrl</code> executes. Static branches on the other hand are predicted quite early, and since the fetcher actually runs faster than the rest of the pipeline it can often fill the bubble before any CPU time is lost.  <a href="http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars/3" rel="nofollow">This sketch of the CellPPC pipeline</a> may make it slightly clearer: in that illustration, a static branch can be predicted by the eighth stage of the pipeline, meaning that  only eight instructions need to be annulled when the fetcher goes to the new address; an indirect branch on the other hand doesn&#8217;t start fetching from the new address until it hits the EX3 stage.</p>
<p>I&#8217;ll try to post my test code up soon, and write a followup to explain the pipeline issues more clearly.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steven</title>
		<link>http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/comment-page-1/#comment-72</link>
		<dc:creator>Steven</dc:creator>
		<pubDate>Mon, 19 Jan 2009 21:20:16 +0000</pubDate>
		<guid isPermaLink="false">http://assemblyrequired.crashworks.org/?p=181#comment-72</guid>
		<description>In the experiment you point at, the overhead corresponds to about 50% of a normal, direct call. That is, a virtual call, due to all of the compiler optimization, is about 150% the time of a direct call, so it&#039;s very litte additional cost for the added flexibility. In an experiment like yours (and mine, for that matter) the function itself does about nothing; barely enough to be not optimized away by the compiler, and that makes the call dominate the timing. But suppose each function know takes 1 microsecond to execute (instead of a few nanoseconds), and your 7 nanosecond overhead is now simply irrelevant.

Also, compilers are smart enough to call directly the virtual function when you&#039;re calling a virtual on a (pointer as a) leaf-class. This may account for an important optimizations, since there&#039;s simply no look-up. You still have to pass the this pointer, which adds instructions, even if very little. Additionnally, the compiler may inline the call as well.

Branch prediction isn&#039;t the whole story either. CPUs (I don&#039;t know about the Xenon, but I would guess it does too) have MMU (memory management units) that snoops on the addresses generated and try to prefetch the next read value in cache. The MMU can guess addresses when you scan sequentially, or by incremental, known steps (like, very 16 bytes) but it cannot do anything special if it can&#039;t figure out the address generation pattern, and that&#039;s what happens when you scan a collection of objects with virtual methods: it cannot figure what to load next until the address is actually read, yielding to read faults/time penalties.</description>
		<content:encoded><![CDATA[<p>In the experiment you point at, the overhead corresponds to about 50% of a normal, direct call. That is, a virtual call, due to all of the compiler optimization, is about 150% the time of a direct call, so it&#8217;s very litte additional cost for the added flexibility. In an experiment like yours (and mine, for that matter) the function itself does about nothing; barely enough to be not optimized away by the compiler, and that makes the call dominate the timing. But suppose each function know takes 1 microsecond to execute (instead of a few nanoseconds), and your 7 nanosecond overhead is now simply irrelevant.</p>
<p>Also, compilers are smart enough to call directly the virtual function when you&#8217;re calling a virtual on a (pointer as a) leaf-class. This may account for an important optimizations, since there&#8217;s simply no look-up. You still have to pass the this pointer, which adds instructions, even if very little. Additionnally, the compiler may inline the call as well.</p>
<p>Branch prediction isn&#8217;t the whole story either. CPUs (I don&#8217;t know about the Xenon, but I would guess it does too) have MMU (memory management units) that snoops on the addresses generated and try to prefetch the next read value in cache. The MMU can guess addresses when you scan sequentially, or by incremental, known steps (like, very 16 bytes) but it cannot do anything special if it can&#8217;t figure out the address generation pattern, and that&#8217;s what happens when you scan a collection of objects with virtual methods: it cannot figure what to load next until the address is actually read, yielding to read faults/time penalties.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
