How Slow Are Virtual Functions Really?

Whenever I work with virtual functions I find myself wondering: how much is it costing me to perform all these vtable lookups and indirect calls? The usual truism is that computers are so fast now that it doesn’t matter and that the idea of virtuals being a problem is just another myth.

Our beloved Xenon CPU is in-order, however, so I got curious whether that myth is truly busted for us, and as any Mythbuster can tell you, the only way to know is to build it and test!

I’ll talk about the test results first and then try to explain them in a later article. I built a simple 4-dimensional vector class with accessor functions for x,y,z, and w. Then I set up three arrays (A, B, C) each containing 1024 of these classes (so everything fits into the L1 cache) and ran a loop that simply added them together one component at a time.

class Vector4Test {
  float x,y,z,w;
  float GetX() { return x; ]
  float SetX( float x_ ) { return x=x_; }
  // and so on
Vector4Test A[1024], B[1024], C[1024];
for (int n = 0 ; n = NUM_TESTS ; ++n)
for (int i=0; i < 1024 ; ++i) {
   C[i].SetX( A[i].GetX + () B[i].GetX();
   // and so on for y, z, and w

By specifying whether the Get and Set functions are inline, direct, or virtual, it’s easy to compare the overhead of one kind of function call versus another. Each run through the loop would make three function calls per component times four components times 1024 elements in the array for a total of 12,288 function calls. The inline function is essentially the control group since it measures just the cost of the memory accesses, loop conditionals, and floating-point math without any function call overhead at all. Here’s the results:

NOTE: The values below have been corrected from the first version of this post. See this comment for details.

1000 iterations over 1024 vectors
12,288,000 function calls
virtual: 159.856 ms
direct: 67.962 ms
inline: 8.040 ms


50000 iterations over 1024 vectors
614,400,000 function calls
virtual: 8080.708 ms
direct: 3406.297 ms
inline: 411.924 ms

A couple of things are immediately obvious. First, virtual functions are slower than direct function calls. But by how much? In the upper trial, the virtual-function test took 91.894ms longer than the direct functions; divided by the 12.288×106 function calls, that works out a differential overhead of about 7 nanoseconds. So, there is a definite cost there, but probably not something to worry about unless it’s a function that gets called thousands of times per frame.

Later I’ll get further into the causes of these disparities, why virtual functions are slower than direct calls, and when inlining is advantageous. In the meantime I can tell you for sure that the problem is not the cost of looking up the indirect function pointer from the vtable — that’s only a single unproblematic load operation. Rather the issues lie in branch prediction and the way that marshalling parameters for the calling convention can get in the way of good instruction scheduling.


  1. Steven says:

    In the experiment you point at, the overhead corresponds to about 50% of a normal, direct call. That is, a virtual call, due to all of the compiler optimization, is about 150% the time of a direct call, so it’s very litte additional cost for the added flexibility. In an experiment like yours (and mine, for that matter) the function itself does about nothing; barely enough to be not optimized away by the compiler, and that makes the call dominate the timing. But suppose each function know takes 1 microsecond to execute (instead of a few nanoseconds), and your 7 nanosecond overhead is now simply irrelevant.

    Also, compilers are smart enough to call directly the virtual function when you’re calling a virtual on a (pointer as a) leaf-class. This may account for an important optimizations, since there’s simply no look-up. You still have to pass the this pointer, which adds instructions, even if very little. Additionnally, the compiler may inline the call as well.

    Branch prediction isn’t the whole story either. CPUs (I don’t know about the Xenon, but I would guess it does too) have MMU (memory management units) that snoops on the addresses generated and try to prefetch the next read value in cache. The MMU can guess addresses when you scan sequentially, or by incremental, known steps (like, very 16 bytes) but it cannot do anything special if it can’t figure out the address generation pattern, and that’s what happens when you scan a collection of objects with virtual methods: it cannot figure what to load next until the address is actually read, yielding to read faults/time penalties.

  2. Elan says:

    It’s absolutely true that the more work the function itself does, the less relative weight the virtual function call overhead has compared to the work done inside the function. Still, it’s not zero, and I was just curious to directly measure how much that overhead was.

    Unfortunately the first version of this post had some incorrect numbers in the table for 50000 iterations: what I didn’t realize until I looked my code over this morning was that the CPU-cycle counter I was using for a fast timer only had 32 bits of resolution, so it would alias for anything longer than about 1.5 seconds. It did seem odd that somehow the relative difference between virtual and inline function calls would get smaller the more times the test was run. Anyway I’ve corrected the numbers above and it shows that the virtual overhead is the same for the larger test run as it is for the smaller test run, and a virtual call takes 2.37x as long as a direct call.

    I think we’re talking about the same thing when I say branch prediction and you say the MMU. In the Xenon, the instruction-fetcher is considered to be part of the CPU pipeline, not the MMU, and the address snooping is performed by the branch prediction stage of the CPU pipe. The MMU is responsible for hauling the instruction stream from main RAM into the cache, but in all of my test cases the code is small enough that it just stays in the icache after being loaded for the first time, and so cache misses don’t figure into the calculation.

    The problem with branch-predicting virtuals on the Xenon (and the Cell’s PPC, which is nearly the same) is that the indirect branch op, bctrl can’t snoop the branch address to report to the fetch stage of the pipeline and so it continues to fetch instructions from the wrong address until the bctrl executes. Static branches on the other hand are predicted quite early, and since the fetcher actually runs faster than the rest of the pipeline it can often fill the bubble before any CPU time is lost. This sketch of the CellPPC pipeline may make it slightly clearer: in that illustration, a static branch can be predicted by the eighth stage of the pipeline, meaning that only eight instructions need to be annulled when the fetcher goes to the new address; an indirect branch on the other hand doesn’t start fetching from the new address until it hits the EX3 stage.

    I’ll try to post my test code up soon, and write a followup to explain the pipeline issues more clearly.

  3. Steven says:

    The problems with indirect jumps (such as virtual) is that they are not entirely mapped as code, and I’m not sure how the various CPUs deal with that. For example, for an Intel processor, functions[i](args) is considered normal data until the call is executed. The functions[i] part is a normal data read from a data segment. If you somehow can make sure the values from functions[i] are read sequentially, the MMU should manage to bring functions[i+1] in cache (and i+2, i+3,…). As for the actual value of functions[i] which is used to perform the call, yes, I guess you are right: it messes the jump prediction good, because it’s data that is read from memory that is turned into code (so to speak).

    The ~2.4x just shows that the Xenon has a very different architecture than AMD and Intel x86/x86_64 chips.

  4. Mark James says:

    “I can tell you for sure that the problem is not the cost of looking up the indirect function pointer from the vtable”

    That’s true if you make sure everything is in the cache. If not, the vtable lookup is an extra cache miss, and you can get much worse performance.

  5. Allan says:

    If you have a tight loop, the first time you hit the virtual call, you are highly likely to have an L2 cache miss on the vtable lookup, which on 360 is ~610 cycles.
    Then you are very likely to have a branch mispredict, which can be another 24 cycles while you fetch instructions from the *right* place, then carry on.
    As Steven points out, the additional cost can be inconsequential and obviously many other criteria (ease of expression, how else can I do virtual style indirection without an L2 miss, overall engineering simplicity) come into play.
    But you do get small accessors appearing in performance profiles a lot, and often the author may not have given it any thought and assumed it would be inlined.Also on 360, we often spot overzealous use of virtual simply via spotting bctr penalties with the same target address 100% of the time. Thats a penalty you quite likely can just drop (all else equal).
    I’m not sure I understand Steven’s point on the compiler being smart enough to directly call virtuals on a leaf class – do you mean virtual calls within the class? This would imply the compiler is able to tell that there are no other derived classes declared anywhere else in the project…?
    Calling a virtual externally via a base class pointer, I’m not sure how the compiler can know to call directly. To my knowledge, the compiler can only inline when it definitely knows the type of the object..
    Object& object; object->Virtual(); // Inline away
    Object* pObject = ptr; pObject->virtual(); // Can’t inline, surely


  6. […] months ago I’ve read this article by Ellan Ruskin, when he measures the overhead of virtual functions. There’s also a follow-up […]

  7. […] calls are known to be more costly than calls that are resolved at compile time.  Elan Ruskin measured ~50% difference – I measured a bit less, but the difference is certainly there. For functions […]

  8. […] catch-all declarations of “virtual functions are slow!”  For the general case, this can be proved as a nonsense as virtual function calls are blatantly not slow!  They’re very, very fast.  However, if […]

  9. […] How Slow Are Virtual Functions Really ↩ […]

  10. stingoh says:

    Very late to the party, but Steven (original first comment), the PPU on Xenon and PS3 does not do data prefetching, only instruction prefetching.

Leave a Reply