How Slow Are Virtual Functions Really?

Whenever I work with virtual functions I find myself wondering: how much is it costing me to perform all these vtable lookups and indirect calls? The usual truism is that computers are so fast now that it doesn’t matter and that the idea of virtuals being a problem is just another myth.

Our beloved Xenon CPU is in-order, however, so I got curious whether that myth is truly busted for us, and as any Mythbuster can tell you, the only way to know is to build it and test!

I’ll talk about the test results first and then try to explain them in a later article. I built a simple 4-dimensional vector class with accessor functions for x,y,z, and w. Then I set up three arrays (A, B, C) each containing 1024 of these classes (so everything fits into the L1 cache) and ran a loop that simply added them together one component at a time.

class Vector4Test {
  float x,y,z,w;
public:
  float GetX() { return x; ]
  float SetX( float x_ ) { return x=x_; }
  // and so on
}
Vector4Test A[1024], B[1024], C[1024];
 
for (int n = 0 ; n = NUM_TESTS ; ++n)
for (int i=0; i < 1024 ; ++i) {
   C[i].SetX( A[i].GetX + () B[i].GetX();
   // and so on for y, z, and w
}

By specifying whether the Get and Set functions are inline, direct, or virtual, it’s easy to compare the overhead of one kind of function call versus another. Each run through the loop would make three function calls per component times four components times 1024 elements in the array for a total of 12,288 function calls. The inline function is essentially the control group since it measures just the cost of the memory accesses, loop conditionals, and floating-point math without any function call overhead at all. Here’s the results:

NOTE: The values below have been corrected from the first version of this post. See this comment for details.

1000 iterations over 1024 vectors
12,288,000 function calls
virtual: 159.856 ms
direct: 67.962 ms
inline: 8.040 ms

 

50000 iterations over 1024 vectors
614,400,000 function calls
virtual: 8080.708 ms
direct: 3406.297 ms
inline: 411.924 ms

A couple of things are immediately obvious. First, virtual functions are slower than direct function calls. But by how much? In the upper trial, the virtual-function test took 91.894ms longer than the direct functions; divided by the 12.288×106 function calls, that works out a differential overhead of about 7 nanoseconds. So, there is a definite cost there, but probably not something to worry about unless it’s a function that gets called thousands of times per frame.

Later I’ll get further into the causes of these disparities, why virtual functions are slower than direct calls, and when inlining is advantageous. In the meantime I can tell you for sure that the problem is not the cost of looking up the indirect function pointer from the vtable — that’s only a single unproblematic load operation. Rather the issues lie in branch prediction and the way that marshalling parameters for the calling convention can get in the way of good instruction scheduling.

More About Rounding On MSVC-x86

The feedback I got on yesterday’s article on float-to-int conversion prompted me to look more closely into all the different options MSVC actually gives you for rounding on the x86 architecture. It turns out that with /fp:fast set it can do one of three things (in addition to the magic-number rounding you can write yourself):

  • By default it will call a function _ftol2_sse, which tests the CPU to see if it has SSE2 functionality. If so, it uses the native SSE2 instruction cvttsd2si. If not, it calls _ftol(). This is quite slow because it has to perform that CPU test for every single conversion, and because there is that overhead of a function call.
  • With /QIfist specified, the compiler simply emits a fistp opcode to convert the x87 floating point register to an integer in memory directly. It uses whatever rounding mode happens to be set in the CPU at the moment.
  • With /arch:SSE2 specified, the compiler assumes that the program will only run on CPUs with SSE2, so it emits the cvttsd2si opcode directly instead of calling _ftol2_sse. Like /QIfist, this replaces a function call with a single instruction, but it’s even faster and not deprecated. As commenter cb points out, the intrinsics also let you specify truncation or rounding without having to fool around with CPU modes.

I raced the different techniques against each other and the clear winner was the function compiled with /arch:SSE2 set. Thus, if you can assume that your customer will have a CPU with SSE2 enabled, setting that simple compiler switch will provide you with superior performance for basically no work. The only caveat is that the SSE scalar operations operate at a maximum of double-precision floats, whereas the old x87 FPU instructions are internally 80-bit — but I’ve never seen a game application where that level of precision makes a difference.

According to the Steam Hardware Survey, 95% of our customers have SSE2-capable CPUs. The rest are probably not playing your most recent releases anyway.

Comparison of rounding speeds
8 trials of 1.024*108 floats on a Core2
/fp:fast magic number /arch:sse2 /QIfist
312.944ms 184.534ms 96.978ms 178.732ms
314.255ms 182.105ms 91.390ms 178.363ms
311.359ms 181.397ms 89.606ms 182.709ms
309.149ms 181.023ms 87.732ms 180.485ms
309.828ms 181.405ms 91.891ms 184.785ms
309.595ms 176.970ms 86.886ms 178.501ms
309.081ms 179.109ms 86.885ms 177.811ms
308.208ms 176.873ms 86.796ms 178.051ms

Why You Should Never Cast Floats To Ints

The seemingly simple act of assigning an int to a float variable, or vice versa, can be astonishingly expensive. In some architectures it may cost 40 cycles or even more! The reasons for this have to do with rounding and the load-hit-store issue, but the outcome is simple: never cast a float to an int inside a calculation “because the math will be faster” (it isn’t), and if you must convert a float to an int for storage, do it only once, at the end of the function, when you write the final outcome to memory.

This particular performance suck has two major sources: register shuffling and rounding. The first one has to do with hardware and affects all modern CPUs; the second is older and more specific to x86 compilers.

In both cases, one simple rule holds true: whenever you find yourself typing *((int *)(&anything)), you’re in for some pain.

Register Sets: The Multiple-Personality CPU

Casting floats to ints and back is something that can happen in routine tasks, like

int a = GetSystemTimeAsInt(); 
float seconds = a;

Or occasionally you may be tempted to perform bit-twiddling operations on a floating point number; for example, this might seem like a fast way to determine whether a float is positive or negative:

bool IsFloatPositive(float f)
{
    int signBit = *((int *)(&f)) & 0x80000000;
    return signBit == 0;
}

The problem here is that casting from one register type to another like this is an almost sure way to induce a load-hit-store. In the x86 and the PPC, integers, floating-point numbers, and SIMD vectors are kept in three separate register sets. As Becky Heineman wrote in Gamasutra, you can “think of the PowerPC as three completely separate CPUs, each with its own instruction set, register set, and ways of performing operations on the data.”

Integer operations (like add, sub, and bitwise ops) work only on integer registers, floating-point operations (like fadd, fmul, fsqrt) only on the FPU registers, and SIMD ops only touch vector registers. This is true of both the PowerPC and the x86 and nearly every modern CPU (with one exception, see below).

This makes the CPU designer’s job much easier (because each of these units can have a pipeline of different depth), but it means there is no means of directly moving data from one kind of register to another. There is no “move” operation that has one integer and one float operand. Basically, there are simply no wires that run directly between the int, float, and vector registers.

So, whenever you move data from an int to a float, the CPU first stores the integer from the int register to memory, and then in the next instruction reads from that memory address into the float register. This is the very definition of a load-hit-store stall, because that first store may take as many as 40 cycles to make it all the way out to the L1 cache, and the subsequent load can’t proceed until it has finished. On an in-order processor like the 360 or the PS3’s PPC, that means everything stops dead for between 40 and 80 cycles; on an out-of-order x86, the CPU will try to skip ahead to some of the subsequent instructions, but can usually only hide a little bit of that latency.

It’s not the actual conversion of ints to floats that is slow — const int *pA; float f = *pA; can happen in two cycles if the contents of pA are already in memory — but moving data between the different kinds of registers that is slow because the data has to get to and from the memory first.

What this all boils down to is that you should simply avoid mixing ints, floats, and vectors in the same calculation. So, for example, instead of

struct {int a,b} Foo;
float func( Foo *data )
{
  float x = fsqrt((( data->a << 1 ) - data->b));
  return x;
}

you are really better off with

float func( Foo *data )
{
  float fA = data->a;
  float fB = data->b;
  float x = fsqrt((( fA * 2.0f ) - fB ));
  return x;
}

More importantly, if you have wrapped your native SIMD type in a union like

union {
  __vector V;  // native 128-bit VMX register
  struct { float x,y,z,w; }
} vec4;

then you really need to avoid accessing the individual float members after working with it as a vector. Never do this:

vec4 A,B,C;
C = VectorCrossProductInVMX(A, B);
float x = C.y * A.x * 2.0f;
return x;

There is one notable exception: the Cell processor SPU as found in the PS3. In the SPUs, all operations — integers, floats, and vectors — operate on the same set of registers and you can mix them as much as you like.

Now, to put this in context, 80 cycles isn’t the end of the world. If you’re performing a hundred float-to-int casts per frame in high level functions, it’ll amount to less than a microsecond. On the other hand it only takes 20,000 such casts to eat up 5% of your frame on the 360, so if this is the sort of thing you’re doing in, say, AccumulateSoundBuffersByAddingTogetherEverySample(), it may be something to look at.

Rounding: Say Hello To My Little fist

In times gone by one of the easiest performance speedups to be had on the PC and its x86-based architecture was usually found in the way your program rounded floats.

The most obvious and least hardware-specific cost of int a = float b is that you somehow have to get rid of the fractional part of the number to turn it into a whole integer; in other words, the CPU has to turn 3.14159 into 3 without involving the Indiana State Legislature. That’s simple enough, but what if the number is 3.5 — do you round it up, or down? How about -3.5 — up or down? And how about on x86-based chips, where floating point numbers are calculated inside the CPU as 80 bits but an int is 32 bits?

At the time the Intel x87 floating-point coprocessor was invented, the IEEE 754 floating point standard specified that rounding could happen in one of four ways:

  1. Round to nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest even number
    ( 3.3 → 3 , 3.5 → 4, -2.5 → -2 )
  2. Round to zero – also known as truncate, simply throws away everything after the decimal point
    ( 3.3 → 3 , 3.5 → 3, -2.5 → -2 )
  3. Round up –
    ( 3.3 → 4 , 3.5 → 4, -2.5 → -2 )
  4. Round down –
    ( 3.3 → 3 , 3.5 → 3, -2.5 → -3 ).

The x87 allows you to select any of these modes by setting or clearing a couple of bits in a special control register. Reading and writing that register is a very slow operation, because it means the processor has to totally throw away anything that came behind it in the pipeline and start over, so it’s best to change modes as little as possible, or not at all.

The actual rounding operation can be done in one instruction (the amusingly named fist op, which means “float-to-int store”), but there’s a snag. The ANSI C standard decrees that one and only one of these modes may ever be used for int a = float b: truncate. But, because the compiler can never be sure what rounding mode is set when you enter any particular function (you might have called into some library that set it differently), it would call a function called _ftol(), which set this mode each and every time a number was rounded. In fact, what it actually did for every cast was:

  1. Call into _ftol()
  2. Check the old rounding mode and save it to memory
  3. Set the rounding mode to “truncate” (this causes a pipeline clear)
  4. Round the number
  5. Set the rounding mode back to the one it saved in step one (another pipeline clear)
  6. Return from _ftol()

Because of this it wasn’t unusual to see a game spending over 6% of its time inside _ftol() alone. (In fact I can say with a straight face that I once saw a profile where a game spent fully 8% of each frame on fisting.) This is an extreme case of the compiler choosing correctness over speed.

You’re thinking the answer is “well, how about I just set the rounding mode to start with and tell the compiler not to obsess so much about the exact correctness?” and you’re right. The solution in MSVC is to supply the /QIfist compiler option, which tells the compiler to assume the current rounding mode is correct and simply issue the hardware float-to-int op directly. This saves you the function call and two pipeline clears. If your rounding mode gets changed elsewhere in the program you might get unexpected results, but… you know.. don’t do that.

Microsoft’s documentation claims that /QIfist is “deprecated” due to their floating-point code being much faster now, but if you try it out you’ll see they’re fibbing. What happens now is that they call to _ftol2_sse() which uses the modeless SSE conversion op cvttsd2si instead of old _ftol(). This has some advantages — you can pick between truncation and rounding for each operation without having to change the CPU’s rounding mode — it’s still a needless function call where an opcode would do, and it shuffles data between the FPU and SSE registers which brings us back to the LHS issue mentioned above. On my Intel Core2 PC, a simple test of calling the function below is twice as fast with compiler options /fp:fast /QIfist specified compared with only /fp:fast.

void test1(volatile int *a, volatile float *f)
{
  for (int i=0; i < 1000000 ; ++i)
    *a = (int) *f;
}

On the other hand, in an absolute sense _ftol2_sse() is pretty fast so it may be good enough.

It’s also possible to convert floats to ints by adding them to a certain magic number, but this isn’t always a benefit. In times of yore the fistp op was slow, so there was an advantage to replacing the fist with a fadd, but this doesn’t seem to be the case any more. It is faster than an implicit call to _ftol2_sse, and it has the advantage of not depending on the CPU’s current rounding mode (since you can pick your magic number to choose between rounding and truncation). On the other hand if you’ve specified /arch:sse2 and the compiler is using SSE scalar operations instead of the x87 FPU generally, then it’s faster to let it use the native cvttss2si op.

On the 360/PS3 CPU, the magic number technique is usually a performance hit, because most of the magic-number tricks involve an integer step on the floating-point number and run into register-partitioning issue mentioned above.

Further Reading

Down With fcmp: Conditional Moves For Branchless Math

Although it’s often considered one of the dustier branches of C++, the ?: ternary operator can sometimes compile to code that actually runs much faster than the equivalent if/else. Some compilers translate it directly to a hardware operation called a conditional move which allows the compiler to select between two values without having to perform any comparison or branch operations. As I’ll show below, in some cases, particularly where floating point numbers are involved, this can result in really significant performance gains.

The Slow Tree Is Full Of Branches

In opcode parlance, any instruction that causes the CPU to skip from one instruction to another is called a branch. If the jump occurs based on some condition being met, like in an if ( x == y ) { doSomething() ; } block, it’s called a conditional branch: in this case, the program branches to either run doSomething() or not depending on the result of a “compare x to y” operation.

For various reasons, a branch instruction can be pretty slow to execute. The important one here is that modern CPUs process their instructions in many steps along a pipeline. Like parts moving along a conveyor belt in a factory, a CPU have as many as eighty program instructions in flight at once, each of them being processed a little bit more at each of the eighty steps along the assembly line until it is finished and the final results written back to memory.

When a conditional branch instruction executes, the CPU can’t know which instructions are supposed to come after the branch on the assembly line. Usually instructions are executed in sequence, so that opcode #1000 is followed by #1001 and that by #1002 and so on, but if #1002 turns out to be a branch to instruction #2000, then any work the processor might have done on instructions #1003-#1010 at earlier stages in the pipeline has to be thrown away. Modern CPUs and compilers have gotten quite good at mitigating this effect through branch prediction but in some cases, particularly involving floating-point numbers, branch prediction may be impossible in a way that causes the entire assembly line to be cleared out and restarted.

To understand this, first consider this pseudoassembly for how we could compile x = (a >= 0 ? b : c) using a compare followed by a branch instruction:

// assume four registers named ra, rb, rc, rx
// where rx is the output
// and r0 contains zero
COMPARE ra, r0 // sets "ge" condition bit to 1 if ra >= 0
JUMPGE cbranch // executes a GOTO cbranch: if the "ge" bit is 1
MOVE rx, rb    // sets rx = rb
JUMP finish
cbranch:
MOVE rx, rc    // sets rx = rc
finish:
RETURN rx

There are a couple of ways that this can slow down the instruction pipeline. Firstly, it always executes at least one branch operation, which is expensive: on the PowerPC such a branch can stall execution for between five and twenty cycles depending on whether it is predicted correctly. Also, the conditional jump occurs immediately after the COMPARE operation, which can lead to another stall if the result of the COMPARE isn’t ready within a single cycle.

But with floating-point compares the situation can be much worse because float pipelines are often much longer than the integer/branch pipeline. This means that the result of the COMPARE may not be available to the branch unit for many cycles after it dispatches. When the CPU tries to branch on a result that isn’t available yet, it has no choice but to flush the entire pipeline, wait for the compare to finish and start over. For a modern 40-cycle long pipeline this can be quite costly indeed! For example, consider this simplified pipeline of a hypothetical machine.

The fetch stage takes six cycles, then instructions are dispatched after a three cycle delay to either an integer, branch, or float pipeline.

a float-compare enters the fetch stage, immediately followed by a branch-on-greatereq

a float-compare enters the fetch stage, immediately followed by a branch-on-greater-or-equal

The fcmp instruction begins to execute, but its results won’t be ready until it leaves the eight-stage float pipeline. So, the branch instruction immediately following it can’t execute yet.

The fcmp begins to execute, but the branch pipeline can't evaluate until the results are ready.

The fcmp begins to execute, but the branch pipeline can't evaluate until the results are ready.

It (and all the instructions behind it) gets flushed from the pipeline and has to start over from the beginning.

The fcmp continues to execute, while everything behind it in the pipeline is flushed altogether.

The fcmp continues to execute, while everything behind it in the pipeline is flushed altogether.

Once the fcmp is done, the branch is allowed to enter the pipe again,

Branch instruction reenters pipeline

Branch instruction starts over while fcmp executes.

and then finally executes some time later. Even though the branch immediately follows the fcmp in the program, it doesn’t actually get executed until over twenty cycles later!

Branch instruction finally begins to execute

Branch instruction finally executes after fcmp completes.

To solve all of these issues and others, hardware designers have invented instructions that can perform all this work in a single branchless operation. Unfortunately, compilers aren’t always smart enough to use them, so sometimes you need to do a little work yourself.

The conditional move

Because branches can be slow, many architectures implement a means of selecting between two different values based on a third without having to actually execute a comparison and jump operation. This is usually called a “conditional move” or “branchless select” and expresses the operation: if a ≥ 0 then b else c; in C++ that is

// int a, b, c;
int x = a >= 0 ? b : c ;

or, more LISPishly,

( if (>= a 0) b c )

There are numerous reasons why this is useful but from my point of view the best is improving performance by eliminating pipeline bubbles. Unlike an if/else implemented as a branch op, a conditional move doesn’t change the flow of the program or cause it to jump from one instruction or another; it simply assigns either value a or b to the contents of register x in a single instruction. Thus, it can never cause the pipeline to clear or invalidate any of the instructions that come after it.

The floating-point conditional move operator on the PPC is called fsel, which works like:

fsel f0, f1, f2, f3 // f0 = ( f1 >= 0 ? f2 : f3 )

The most recent version of the compiler I use is pretty good at compiling this automatically for cases like this:

return a >= 0 ? b : c; fsel fr1,fr1,fr2,fr3

and it even does the right thing with this:

// float a, b, c, d, e, f;
return ( a >= 0 ? b + 1 : c + 2 ) +
   ( d >= 0 ? e + 1 : f + 2 ) ;
; fr1 = a, fr2 = b, fr3 = c,
; fr4 = d, fr5 = e, fr6 = f
; fr0 = 1.0, fr13 = 2.0f
 fadds   fr12,fr2,fr0      ; fr12 = a + 1
 fadds   fr11,fr3,fr13     ; fr11 = c + 2
 fadds   fr10,fr5,fr0      ; fr10 = e + 1
 fadds   fr9,fr6,fr13      ; fr9  = f + 2
 fsel    fr8,fr1,fr12,fr11 ; fr8  = a >= 0 ? fr12 : fr11
 fsel    fr7,fr4,fr10,fr9  ; fr7  = d >= 0 ? fr10 : fr9
 fadds   fr1,fr8,fr7     ; return = fr8 + fr7

but it has trouble with more complicated scenarios like this:

// float a, b, c, d;

return a >= b ? c : d;

; fr1 = a, fr2 = b, fr3 = c, fr4 = d
fcmpu        cr6,fr1,fr2  ; compare fr1 and fr2
blt          cr6,$LN3     ; if compare result was 
                          ; "less than", GOTO $LN3
fmr          fr1,fr3      ; move fr3 to fr1
blr                       ; return
$LN3:
fmr          fr1,fr4      ; move fr4 to fr1
blr                       ; return

and if you use an if/else block instead of the ternary operator, all bets are off.

So, for these more complex cases, the compiler exposes as an intrinsic with a prototype something like:

float fsel( float a, float x, float y );
// equivalent to { return a >= 0 ? x : y ; }

A compiler intrinsic looks like an ordinary C++ function but compiles directly to the native hardware instruction, providing you with direct access to the opcode without having to actually write assembly. The caveat is that the hardware fcmp can only perform a ≥ 0 comparison (on the PPC anyway), so for complex conditionals like

if ( a / 2 < b + 3 ) { return c; } else { return d; }

you need to do a little algebra to transform the inequality a / 2 < b + 3 into a - 2b - 6 < 0 and then, because x < 0 ≡ ¬(x ? 0), flip the order of the operands to get return _fsel( a - 2*b - 6, d, c );. Once you’ve got everything fed into the intrinsic, the compiler can do a good job of scheduling your other instructions.

Life beyond if

The advantage of conditional moves comes when you use them to eliminate branching from a complex mathematical operation altogether. The most obvious cases are something like the good old min( float a, float b ) { return a <= b ? a : b } which becomes return _fsel( b – a, a, b ); which means you get through something wacky like x += min(a,b) * max(c,d) without any branches at all.

Because math is costly, it may seem like a good idea to use an if statement to prevent unnecessary calculations from being performed, but this can be the wrong thing to do. Sometimes using a floating-point branch to early-out of some calculations may actually hurt performance. In some cases, the cost of the branch may be much greater than the cost of the calculation itself, and so if you’re using if/else to choose between performing one of two different calculations, it may be better to do both calculations and use fsel to choose between them afterwards.

Often the compiler can interleave the calculations so the cost of both is the same as doing only one, and you avoid the pipeline clear caused by the fcmp. For example, this code:

float out[N], inA[N], inB[N], cond[N];
for (int i = 0 ; i < N ; ++i )
{
  if ( cond[i] >= 0 )
    {     out[i] = cond[i] + inA[i];   }
  else
    {     out[i] = cond[i] + inB[i];   }
}

can be turned into:

for (int i = 0 ; i < N ; ++i )
{
    out[i] = fsel( cond[i], cond[i] + inA[i], cond[i] + inB[i] );
}

In the top version, we choose to do one of two additions based on the sign of cond[i]. In the bottom version, we perform two additions and throw away one result, but even so it is much faster! When I tested 200,000,000 iterations of the above code, the if/else version took 5.243 seconds compared with 0.724 seconds for the fsel version: a 7x speedup for not avoiding an addition!

Another example is if you have a large quantity of such assignments to do one after another, like
struct CFoo { float x, y, z, w; }
// CFoo out, A, B
// float cond[4]; (some number that acts as a selector)
out.x += fsel( cond[0], A.x, B.x );
out.y += fsel( cond[1], A.y, B.y );
out.z += fsel( cond[2], A.z, B.z );
out.w += fsel( cond[3], A.w, B.w );

In this case, because all of the fsel operations are independent (the result of one line does not depend on the result of another), they can all be scheduled to occur immediately after one another for a throughput of one per cycle. Had if/else been used instead (eg if ( cond[0] >= 0 ) out.x += A.x; else out.x += B.x;), then each branch would have depended on the result of the preceding comparison, meaning that we would need to pay the full 60-cycle fcmp/branch penalty for each line before starting the next one.

By now you’re probably looking at the example above and thinking, “for four floats packed together like that, I could use a SIMD instruction to compare and assign them all at once,” and if so you’re absolutely right. Both VMX (on the PowerPC) and SSE (on the x86) also have conditional move instructions that work very similarly to the scalar float operations I’ve described above. Consult your documentation for the exact semantics, but in general they work by performing a comparison operation to create a mask which is then fed into a subsequent “select” operation.

The general principle is that it can be better to perform both arms of a calculation and then select between them afterwards than to use an if/else to perform only one arm of the calculation. This is even more true for SIMD units since the pipelines are often deeper.

Integer Conditional Move

All the discussion above has focused on floating-point conditional move operations. The x86 platform offers an analogous integer conditional move, called CMOV, but some PowerPC implementations lack this native opcode. Instead, you can manufacture one yourself by creating a mask and then using a bitwise-and to select between two values.

The key is to remember that for any signed int a < 0, the leftmost (most significant) bit of a will be 1, and that the C++ arithmetic-shift operator >> always preserves the sign of the word you are shifting. So, int mask = a >> 31; means the same as int mask = (a < 0 ? -1 : 0).

Once you have your mask, you can combine it with a bitwise-and and an add to make your conditional move. Given that x + ( yx ) = y, you can generate your integer select function like so:

// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return x + ((y - x) & mask);
};

The assembly for this works out to about four instructions. The performance gain usually isn’t as great as with fsel, because the integer pipeline tends to be much shorter and so there’s no pipeline clear on an integer compared followed by a branch; however this construction does give the compiler more latitude in scheduling instructions, and lets you avoid the innate prediction penalty that always occurs with any branch.

In addition to this most PowerPCs have a “count leading zeroes” instruction that lets a smart compiler do something really cool with return a + ( (b - a) & ((Q == W) ? 0 : -1) );. Try it out in your disassembler and see!

Further Reading

Mike Acton writes extensively about branch elimination at his CellPerformance blog. In addition to fsel specifically, he describes many other techniques of branchless programming.

Happy Holidays! (no post today)

This is the week when all of Santa’s Elves who work in game-making-places all over the world take a hopefully well deserved break, so no game development post today. Instead, here’s some nice audio books and podcasts that I’ve enjoyed on my commutes in the past year. Happy Holidays!

  • Escape Pod‘s short-fiction podcast has not one but two Christmas specials.
  • X Minus One, The Lifeboat Mutiny. 50s-era science fiction radio play: two prospectors find themselves in possession of a lifeboat that’s more than it seems. Worth listening to for the Drone National Anthem alone!
  • G.K. Chesterton’s  The Wisdom Of Father Brown is available as a public-domain audio book from LibriVox, read by Martin Clifton. I found this collection of short stories about the eponymous priest-turned-private-detective to be fine treadmill listening.
  • WNYC’s science documentary special Radio Lab is always a singular pleasure.
  • And Subterranean Press was kind enough to produce a free audio book of Charlie Stross’ Trunk and Disorderly, a humorous adventure in a style that defies description (the closest anyone’s come is “Heinlein meets Wodehouse in space with a woolly mammoth”).

I’ll be back next week with some thoughts on replacing floating-point conditionals with branchless math.

Sentences That Should Be Carved Into Foreheads

In Paul Pedriana’s article on Electronic Arts’ custom STL I came across this Very Good Sentence (well, okay, pair of sentences):

Game applications cannot leak memory. If an application leaks even a small amount of memory, it eventually dies.

I say preach it, brother. I can think of a couple of titles where memory leaks actually hurt a game’s sales. In fact I’ve been known to blame reckless use of new for the death of entire studios. malloc() — there’s no evil too great to lay at its feet.

Paul’s sentence comes in the context of a group of justifications for EA building its own Standard Template Library, which are so concisely spot-on that I’ll reproduce them here. Call them Paul’s Commandments:

Read the rest of this entry »

A better Windows environment variable editor

Doesn’t Windows’ built-in dialog for editing environment variables just suck? The one you get when you right-click “My Computer” and go to properties→advanced, I mean. It looks like a holdover from Windows 3.1 — a tiny, unresizable window that shows you at most part of five environment variables, and an “edit” button that leads to a similarly cramped and unresizable pair of forms which are never big enough.

windows_environment_dialog1 windows_environment_dialog2

Whenever I need to edit my %PATH% (which is often) I find myself copying the whole thing out of the Windows dialog, pasting it into Notepad, editing it there, and then copying-and-pasting back into the Windows dialog.

Enter the freeware Rapid Environment Editor, a sweet little tool that is infinitely better than the Windows builtin for editing %PATH% and all your other envvars without pain. Long semicolon-seperated environment variables are displayed as drop-down trees (like my PATH below), and it even highlights broken folders with red text!

rapid_environment_editor_dialog

It’s a quick 800kb download and the best thing to happen to me today.

Router QoS for Source games running on Steam

I recently upgraded the firmware on my home wireless router to Tomato, which has very configurable network quality-of-service settings so that the router knows that some traffic is more time-sensitive than others and that it should take priority when my Internet connection starts to fill up. This is so that I can download a Steam game and browse the web on my laptop while my girlfriend plays Team Fortress 2 on her PC without my causing her game to become laggy and unresponsive.

Surprisingly, when I looked around the ‘net for the appropriate settings to use, I didn’t find very much useful information. The first thing I tried was to simply prioritize all TCP and UDP traffic on every port Source uses (27000-27050), but this meant that Steam game downloads were prioritized at the same level as TF2 game client traffic, which turned out to be bad because it meant that her gameplay bogged down whenever I downloaded an update from Steam (or even if she had a download running in the background on her own computer. )

Fortunately it’s possible to differentiate traffic a little more finely than that: the present generation of Source games all use UDP to communicate with the server, while the Steam client uses TCP for downloads. This page on the Steam support website describes which ports get used for what purpose, and after a little experimentation (and maybe a trip or two into the L4D source code) I finally determined that these settings seem to work best for me:

If you’re having trouble with TF2 or Left4Dead behaving poorly on your network, you might want to try checking your own router’s QoS settings and maybe setting them similarly.

One more thing to remember about home router QoS is that you’re only prioritizing the traffic that you uplink to the internet; it can’t prioritize the traffic that you’ve already received from your ISP (but since downlink speeds are usually so much bigger than uplink speeds this usually isn’t an issue).

Also, for QoS to work properly you need to have an accurate estimate of your total upload speed. This is because all your traffic priorities are expressed as percentages of total bandwidth (eg “Low” priority traffic is guaranteed only 3% of your link while “High” gets, say, 50%), so a low upload speed estimate would cap all your connections too low, and a too-high one would allocate too much bandwidth to low priorities. High end routers such as the D-Link DGL-4300 measure your upload bandwidth automatically, but some firmwares (such as Tomato and DD-WRT) require you to measure your upload speed manually and then enter it into the control panel. You can measure your upload speed at speedtest or Speakeasy (and it’s best to enter a speed that’s 90% of your measured speed, to leave room for packet overhead).

Some routers also support Level 7 filtering, which prioritizes IP connections based on the application-level data inside them rather than simply their source and destination ports. It looks like there are filters for many games, including TF2, but looking at their implementation I suspect simple port-based QoS may work better.

December Reading List

In the realm of “things I really wish I had paid more attention to while we were crunching on Left 4 Dead“, I’ve recently been going through the talks from last June’s Gamefest. There is a lot of really great stuff in there, and I’ve only skimmed through a few of them so far, but among those few the standouts are:

  • Xbox 360 Compiler and PgoLite Update: where Bruce Dawson describes the pure awesome that has been poured into the XDK’s CPU performance-analysis tools, such as the Continuous Capture mode for PIX that lets you determine what caused a perf spike after it happens. (Yes, a perf tool can be said to contain pure awesome, for definitions of awesome encompassing branch prediction.) He also presents a lock-free pipe implementation added to the XDK libraries for interthread communication.
  • New CPU Features in the XDK: where Dawson continues his reign of awesome in a discussion of improvements to the 360 compiler. He covers faster asserts in array bounds-checking, an improvement to the __fsel intrinsic (the ?: operator isn’t so useless after all), caveats on certain VMX branching operations, and a better way to do SIMD constants. There’s also a bunch of improvements to the 360 compiler itself including better load-hit-store avoidance, better inline assembly, and a more useful /QXSTALLS pipeline simulator (which makes the compiler emit a nerd-readable assembly file for each .CPP with all your performance issues highlighted).
  • At Least We Aren’t Doing That: Real Life Performance Pitfalls: Allan Murphy lays out the fifteen most common performance screwups that the Microsoft support team sees. Each of them has been something I’ve gnashed my teeth over in the past ten years myself, and more than one is something game programmers been dealing with since the days of the PS2 (so really we should know better by now).

I’m still working my way through these docs and planning to put together a précis on them for the devs at Valve, so hopefully I’ll be able to post that here along with some of my own thoughts on the topics as well. Half the things in “At Least We Aren’t Doing That” are certainly issues I’ve been known to get into long lunchtime rants over!

Some other interesting documents on my vast and teetering to-read pile are Gustavo Duarte’s articles on PC internals like x86 memory translation, Christer Ericson’s old GDC2003 talk on memory optimization, and Ulrich Drepper’s definitive but dense What Every Programmer Should Know About Memory. The last one in particular I’m reviewing as a background citation to some notes I’m writing myself on console data caches (because I’d rather have something online and free to link to rather than send everyone to the bookstore to for a copy of Computer Organization And Design): it has a lot of good stuff in it, though much of it is specific to Linux running on x86 CPUs and very different from the way game consoles work. In particular, passages such as this one may bring a chuckle to PS3/Cell developers:

A computer can have a small amount of high-speed SRAM in addition to the large amount of DRAM. One possible implementation would be to dedicate a certain area of the address space of the processor as containing the SRAM and the rest the DRAM. The task of the operating system would then be to optimally distribute data to make use of the SRAM…. While this is a possible implementation it is not viable … this approach would require each process to administer in software the allocation of this memory region…. So instead of putting the SRAM under the control of the OS or user, it becomes a resource which is transparently used and administered by the processors.

The transparent, all-knowing processor indeed.

Sorting Algorithms, Animated

David R. Martin of Boston College made this great webpage demonstrating eight sorting algorithms and their performance under best-case, worst-case, and average-case conditions. It’s a clever, elegant way to show how there is no such thing as “one best sorting algorithm”, but that each has its own advantages and disadvantages depending on the expected data and space available. 

 

Quick Sort demonstration

And, it’s kind of mesmerizing to see the little bars dancing busily back and forth as the algorithm does its thing. You can even race all the different algorithms against each other.