Down With fcmp: Conditional Moves For Branchless Math

Although it’s often considered one of the dustier branches of C++, the ?: ternary operator can sometimes compile to code that actually runs much faster than the equivalent if/else. Some compilers translate it directly to a hardware operation called a conditional move which allows the compiler to select between two values without having to perform any comparison or branch operations. As I’ll show below, in some cases, particularly where floating point numbers are involved, this can result in really significant performance gains.

The Slow Tree Is Full Of Branches

In opcode parlance, any instruction that causes the CPU to skip from one instruction to another is called a branch. If the jump occurs based on some condition being met, like in an if ( x == y ) { doSomething() ; } block, it’s called a conditional branch: in this case, the program branches to either run doSomething() or not depending on the result of a “compare x to y” operation.

For various reasons, a branch instruction can be pretty slow to execute. The important one here is that modern CPUs process their instructions in many steps along a pipeline. Like parts moving along a conveyor belt in a factory, a CPU have as many as eighty program instructions in flight at once, each of them being processed a little bit more at each of the eighty steps along the assembly line until it is finished and the final results written back to memory.

When a conditional branch instruction executes, the CPU can’t know which instructions are supposed to come after the branch on the assembly line. Usually instructions are executed in sequence, so that opcode #1000 is followed by #1001 and that by #1002 and so on, but if #1002 turns out to be a branch to instruction #2000, then any work the processor might have done on instructions #1003-#1010 at earlier stages in the pipeline has to be thrown away. Modern CPUs and compilers have gotten quite good at mitigating this effect through branch prediction but in some cases, particularly involving floating-point numbers, branch prediction may be impossible in a way that causes the entire assembly line to be cleared out and restarted.

To understand this, first consider this pseudoassembly for how we could compile x = (a >= 0 ? b : c) using a compare followed by a branch instruction:

// assume four registers named ra, rb, rc, rx
// where rx is the output
// and r0 contains zero
COMPARE ra, r0 // sets "ge" condition bit to 1 if ra >= 0
JUMPGE cbranch // executes a GOTO cbranch: if the "ge" bit is 1
MOVE rx, rb    // sets rx = rb
JUMP finish
MOVE rx, rc    // sets rx = rc

There are a couple of ways that this can slow down the instruction pipeline. Firstly, it always executes at least one branch operation, which is expensive: on the PowerPC such a branch can stall execution for between five and twenty cycles depending on whether it is predicted correctly. Also, the conditional jump occurs immediately after the COMPARE operation, which can lead to another stall if the result of the COMPARE isn’t ready within a single cycle.

But with floating-point compares the situation can be much worse because float pipelines are often much longer than the integer/branch pipeline. This means that the result of the COMPARE may not be available to the branch unit for many cycles after it dispatches. When the CPU tries to branch on a result that isn’t available yet, it has no choice but to flush the entire pipeline, wait for the compare to finish and start over. For a modern 40-cycle long pipeline this can be quite costly indeed! For example, consider this simplified pipeline of a hypothetical machine.

The fetch stage takes six cycles, then instructions are dispatched after a three cycle delay to either an integer, branch, or float pipeline.

a float-compare enters the fetch stage, immediately followed by a branch-on-greatereq

a float-compare enters the fetch stage, immediately followed by a branch-on-greater-or-equal

The fcmp instruction begins to execute, but its results won’t be ready until it leaves the eight-stage float pipeline. So, the branch instruction immediately following it can’t execute yet.

The fcmp begins to execute, but the branch pipeline can't evaluate until the results are ready.

The fcmp begins to execute, but the branch pipeline can't evaluate until the results are ready.

It (and all the instructions behind it) gets flushed from the pipeline and has to start over from the beginning.

The fcmp continues to execute, while everything behind it in the pipeline is flushed altogether.

The fcmp continues to execute, while everything behind it in the pipeline is flushed altogether.

Once the fcmp is done, the branch is allowed to enter the pipe again,

Branch instruction reenters pipeline

Branch instruction starts over while fcmp executes.

and then finally executes some time later. Even though the branch immediately follows the fcmp in the program, it doesn’t actually get executed until over twenty cycles later!

Branch instruction finally begins to execute

Branch instruction finally executes after fcmp completes.

To solve all of these issues and others, hardware designers have invented instructions that can perform all this work in a single branchless operation. Unfortunately, compilers aren’t always smart enough to use them, so sometimes you need to do a little work yourself.

The conditional move

Because branches can be slow, many architectures implement a means of selecting between two different values based on a third without having to actually execute a comparison and jump operation. This is usually called a “conditional move” or “branchless select” and expresses the operation: if a ≥ 0 then b else c; in C++ that is

// int a, b, c;
int x = a >= 0 ? b : c ;

or, more LISPishly,

( if (>= a 0) b c )

There are numerous reasons why this is useful but from my point of view the best is improving performance by eliminating pipeline bubbles. Unlike an if/else implemented as a branch op, a conditional move doesn’t change the flow of the program or cause it to jump from one instruction or another; it simply assigns either value a or b to the contents of register x in a single instruction. Thus, it can never cause the pipeline to clear or invalidate any of the instructions that come after it.

The floating-point conditional move operator on the PPC is called fsel, which works like:

fsel f0, f1, f2, f3 // f0 = ( f1 >= 0 ? f2 : f3 )

The most recent version of the compiler I use is pretty good at compiling this automatically for cases like this:

return a >= 0 ? b : c; fsel fr1,fr1,fr2,fr3

and it even does the right thing with this:

// float a, b, c, d, e, f;
return ( a >= 0 ? b + 1 : c + 2 ) +
   ( d >= 0 ? e + 1 : f + 2 ) ;
; fr1 = a, fr2 = b, fr3 = c,
; fr4 = d, fr5 = e, fr6 = f
; fr0 = 1.0, fr13 = 2.0f
 fadds   fr12,fr2,fr0      ; fr12 = a + 1
 fadds   fr11,fr3,fr13     ; fr11 = c + 2
 fadds   fr10,fr5,fr0      ; fr10 = e + 1
 fadds   fr9,fr6,fr13      ; fr9  = f + 2
 fsel    fr8,fr1,fr12,fr11 ; fr8  = a >= 0 ? fr12 : fr11
 fsel    fr7,fr4,fr10,fr9  ; fr7  = d >= 0 ? fr10 : fr9
 fadds   fr1,fr8,fr7     ; return = fr8 + fr7

but it has trouble with more complicated scenarios like this:

// float a, b, c, d;

return a >= b ? c : d;

; fr1 = a, fr2 = b, fr3 = c, fr4 = d
fcmpu        cr6,fr1,fr2  ; compare fr1 and fr2
blt          cr6,$LN3     ; if compare result was 
                          ; "less than", GOTO $LN3
fmr          fr1,fr3      ; move fr3 to fr1
blr                       ; return
fmr          fr1,fr4      ; move fr4 to fr1
blr                       ; return

and if you use an if/else block instead of the ternary operator, all bets are off.

So, for these more complex cases, the compiler exposes as an intrinsic with a prototype something like:

float fsel( float a, float x, float y );
// equivalent to { return a >= 0 ? x : y ; }

A compiler intrinsic looks like an ordinary C++ function but compiles directly to the native hardware instruction, providing you with direct access to the opcode without having to actually write assembly. The caveat is that the hardware fcmp can only perform a ≥ 0 comparison (on the PPC anyway), so for complex conditionals like

if ( a / 2 < b + 3 ) { return c; } else { return d; }

you need to do a little algebra to transform the inequality a / 2 < b + 3 into a - 2b - 6 < 0 and then, because x < 0 ≡ ¬(x ? 0), flip the order of the operands to get return _fsel( a - 2*b - 6, d, c );. Once you’ve got everything fed into the intrinsic, the compiler can do a good job of scheduling your other instructions.

Life beyond if

The advantage of conditional moves comes when you use them to eliminate branching from a complex mathematical operation altogether. The most obvious cases are something like the good old min( float a, float b ) { return a <= b ? a : b } which becomes return _fsel( b – a, a, b ); which means you get through something wacky like x += min(a,b) * max(c,d) without any branches at all.

Because math is costly, it may seem like a good idea to use an if statement to prevent unnecessary calculations from being performed, but this can be the wrong thing to do. Sometimes using a floating-point branch to early-out of some calculations may actually hurt performance. In some cases, the cost of the branch may be much greater than the cost of the calculation itself, and so if you’re using if/else to choose between performing one of two different calculations, it may be better to do both calculations and use fsel to choose between them afterwards.

Often the compiler can interleave the calculations so the cost of both is the same as doing only one, and you avoid the pipeline clear caused by the fcmp. For example, this code:

float out[N], inA[N], inB[N], cond[N];
for (int i = 0 ; i < N ; ++i )
  if ( cond[i] >= 0 )
    {     out[i] = cond[i] + inA[i];   }
    {     out[i] = cond[i] + inB[i];   }

can be turned into:

for (int i = 0 ; i < N ; ++i )
    out[i] = fsel( cond[i], cond[i] + inA[i], cond[i] + inB[i] );

In the top version, we choose to do one of two additions based on the sign of cond[i]. In the bottom version, we perform two additions and throw away one result, but even so it is much faster! When I tested 200,000,000 iterations of the above code, the if/else version took 5.243 seconds compared with 0.724 seconds for the fsel version: a 7x speedup for not avoiding an addition!

Another example is if you have a large quantity of such assignments to do one after another, like
struct CFoo { float x, y, z, w; }
// CFoo out, A, B
// float cond[4]; (some number that acts as a selector)
out.x += fsel( cond[0], A.x, B.x );
out.y += fsel( cond[1], A.y, B.y );
out.z += fsel( cond[2], A.z, B.z );
out.w += fsel( cond[3], A.w, B.w );

In this case, because all of the fsel operations are independent (the result of one line does not depend on the result of another), they can all be scheduled to occur immediately after one another for a throughput of one per cycle. Had if/else been used instead (eg if ( cond[0] >= 0 ) out.x += A.x; else out.x += B.x;), then each branch would have depended on the result of the preceding comparison, meaning that we would need to pay the full 60-cycle fcmp/branch penalty for each line before starting the next one.

By now you’re probably looking at the example above and thinking, “for four floats packed together like that, I could use a SIMD instruction to compare and assign them all at once,” and if so you’re absolutely right. Both VMX (on the PowerPC) and SSE (on the x86) also have conditional move instructions that work very similarly to the scalar float operations I’ve described above. Consult your documentation for the exact semantics, but in general they work by performing a comparison operation to create a mask which is then fed into a subsequent “select” operation.

The general principle is that it can be better to perform both arms of a calculation and then select between them afterwards than to use an if/else to perform only one arm of the calculation. This is even more true for SIMD units since the pipelines are often deeper.

Integer Conditional Move

All the discussion above has focused on floating-point conditional move operations. The x86 platform offers an analogous integer conditional move, called CMOV, but some PowerPC implementations lack this native opcode. Instead, you can manufacture one yourself by creating a mask and then using a bitwise-and to select between two values.

The key is to remember that for any signed int a < 0, the leftmost (most significant) bit of a will be 1, and that the C++ arithmetic-shift operator >> always preserves the sign of the word you are shifting. So, int mask = a >> 31; means the same as int mask = (a < 0 ? -1 : 0).

Once you have your mask, you can combine it with a bitwise-and and an add to make your conditional move. Given that x + ( yx ) = y, you can generate your integer select function like so:

// if a >= 0, return x, else y
int isel( int a, int x, int y )
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return x + ((y - x) & mask);

The assembly for this works out to about four instructions. The performance gain usually isn’t as great as with fsel, because the integer pipeline tends to be much shorter and so there’s no pipeline clear on an integer compared followed by a branch; however this construction does give the compiler more latitude in scheduling instructions, and lets you avoid the innate prediction penalty that always occurs with any branch.

In addition to this most PowerPCs have a “count leading zeroes” instruction that lets a smart compiler do something really cool with return a + ( (b - a) & ((Q == W) ? 0 : -1) );. Try it out in your disassembler and see!

Further Reading

Mike Acton writes extensively about branch elimination at his CellPerformance blog. In addition to fsel specifically, he describes many other techniques of branchless programming.


  1. Sam says:

    The C/C++ standards state the result of right shifts on negative signed values is undefined? I believe the correctness of your isel function is at the mercy of the compiler/hardware platform.

    • Cody says:

      I don’t remember for C++ (suspect same) but for C it actually depends. That is to say, it is implementation defined: right shifting by a signed int is a logical shift on some machines and an arithmetic shift on others. So indeed if you rely on it you’re sacrificing portability. Unsigned it is always logical shift (fill with 0s), though.

  2. Elan says:

    In theory perhaps, but as game developers we have exactly four configurations to care about: MSVC for Windows; MSVC for XBox360; Sony-gcc for PS3; CodeWarrior for Wii. Each of these architectures sign-extends on right shift. In fact the whole discussion above is specific to the pipeline of the PPC inside the 360 and the PS3; on other platforms the situation might be entirely different.

    Optimization at this level really means taking advantage of the hardware and its particular implementation; if not, we wouldn’t be able to use SSE or VMX or the SPU either because they’re not perfectly portable.

  3. Adisak says:

    FWIW, if you code your isel to do a mask and mask complement, it will be faster on PowerPC since the compiler is smart enough to generate an ‘andc’ opcode. It’s the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly.

    return (x & (~mask)) + (y & mask);

  4. Mat Hendry says:

    The transformation from

    a >= b ? c : d


    a – b >= 0 ? c : d

    is not correct in all cases because of the possibility of overflow. That’s why the compiler won’t do it for you automatically. You need to tell it that it’s ok.

  5. […] [Edit] I’ve just been pointed at an excellent article on pipelines and branchless maths: “Down With fcmp: Conditional Moves For Branchless Math” […]

  6. Matthew W. S. Bell says:

    Further to Mat Hendry, the transformation a / 2 < b + 3 ? a – 2b – 6 < 0 assumes that the variables are reals, but they are floats, and exceptional cases need to be considered.

  7. […] we get a function like this (code shamelessly taken from here): [prettify class="cpp"] int isel( int a, int x, int y ) { int mask = a >> 31; // arithmetic […]

Leave a Reply