The seemingly simple act of assigning an int to a float variable, or vice versa, can be astonishingly expensive. In some architectures it may cost 40 cycles or even more! The reasons for this have to do with rounding and the load-hit-store issue, but the outcome is simple: never cast a float to an int inside a calculation “because the math will be faster” (it isn’t), and if you must convert a float to an int for storage, do it only once, at the end of the function, when you write the final outcome to memory.

This particular performance suck has two major sources: register shuffling and rounding. The first one has to do with hardware and affects all modern CPUs; the second is older and more specific to x86 compilers.

In both cases, one simple rule holds true: whenever you find yourself typing *((int *)(&anything)), you’re in for some pain.

Register Sets: The Multiple-Personality CPU

Casting floats to ints and back is something that can happen in routine tasks, like

int a = GetSystemTimeAsInt(); 
float seconds = a;

Or occasionally you may be tempted to perform bit-twiddling operations on a floating point number; for example, this might seem like a fast way to determine whether a float is positive or negative:

bool IsFloatPositive(float f)
{
    int signBit = *((int *)(&f)) & 0x80000000;
    return signBit == 0;
}

The problem here is that casting from one register type to another like this is an almost sure way to induce a load-hit-store. In the x86 and the PPC, integers, floating-point numbers, and SIMD vectors are kept in three separate register sets. As Becky Heineman wrote in Gamasutra, you can “think of the PowerPC as three completely separate CPUs, each with its own instruction set, register set, and ways of performing operations on the data.”

Integer operations (like add, sub, and bitwise ops) work only on integer registers, floating-point operations (like fadd, fmul, fsqrt) only on the FPU registers, and SIMD ops only touch vector registers. This is true of both the PowerPC and the x86 and nearly every modern CPU (with one exception, see below).

This makes the CPU designer’s job much easier (because each of these units can have a pipeline of different depth), but it means there is no means of directly moving data from one kind of register to another. There is no “move” operation that has one integer and one float operand. Basically, there are simply no wires that run directly between the int, float, and vector registers.

So, whenever you move data from an int to a float, the CPU first stores the integer from the int register to memory, and then in the next instruction reads from that memory address into the float register. This is the very definition of a load-hit-store stall, because that first store may take as many as 40 cycles to make it all the way out to the L1 cache, and the subsequent load can’t proceed until it has finished. On an in-order processor like the 360 or the PS3′s PPC, that means everything stops dead for between 40 and 80 cycles; on an out-of-order x86, the CPU will try to skip ahead to some of the subsequent instructions, but can usually only hide a little bit of that latency.

It’s not the actual conversion of ints to floats that is slow — const int *pA; float f = *pA; can happen in two cycles if the contents of pA are already in memory — but moving data between the different kinds of registers that is slow because the data has to get to and from the memory first.

What this all boils down to is that you should simply avoid mixing ints, floats, and vectors in the same calculation. So, for example, instead of

struct {int a,b} Foo;
float func( Foo *data )
{
  float x = fsqrt((( data->a << 1 ) - data->b));
  return x;
}

you are really better off with

float func( Foo *data )
{
  float fA = data->a;
  float fB = data->b;
  float x = fsqrt((( fA * 2.0f ) - fB ));
  return x;
}

More importantly, if you have wrapped your native SIMD type in a union like

union {
  __vector V;  // native 128-bit VMX register
  struct { float x,y,z,w; }
} vec4;

then you really need to avoid accessing the individual float members after working with it as a vector. Never do this:

vec4 A,B,C;
C = VectorCrossProductInVMX(A, B);
float x = C.y * A.x * 2.0f;
return x;

There is one notable exception: the Cell processor SPU as found in the PS3. In the SPUs, all operations — integers, floats, and vectors — operate on the same set of registers and you can mix them as much as you like.

Now, to put this in context, 80 cycles isn’t the end of the world. If you’re performing a hundred float-to-int casts per frame in high level functions, it’ll amount to less than a microsecond. On the other hand it only takes 20,000 such casts to eat up 5% of your frame on the 360, so if this is the sort of thing you’re doing in, say, AccumulateSoundBuffersByAddingTogetherEverySample(), it may be something to look at.

Rounding: Say Hello To My Little fist

In times gone by one of the easiest performance speedups to be had on the PC and its x86-based architecture was usually found in the way your program rounded floats.

The most obvious and least hardware-specific cost of int a = float b is that you somehow have to get rid of the fractional part of the number to turn it into a whole integer; in other words, the CPU has to turn 3.14159 into 3 without involving the Indiana State Legislature. That’s simple enough, but what if the number is 3.5 — do you round it up, or down? How about -3.5 — up or down? And how about on x86-based chips, where floating point numbers are calculated inside the CPU as 80 bits but an int is 32 bits?

At the time the Intel x87 floating-point coprocessor was invented, the IEEE 754 floating point standard specified that rounding could happen in one of four ways:

  1. Round to nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest even number
    ( 3.3 → 3 , 3.5 → 4, -2.5 → -2 )
  2. Round to zero – also known as truncate, simply throws away everything after the decimal point
    ( 3.3 → 3 , 3.5 → 3, -2.5 → -2 )
  3. Round up –
    ( 3.3 → 4 , 3.5 → 4, -2.5 → -2 )
  4. Round down –
    ( 3.3 → 3 , 3.5 → 3, -2.5 → -3 ).

The x87 allows you to select any of these modes by setting or clearing a couple of bits in a special control register. Reading and writing that register is a very slow operation, because it means the processor has to totally throw away anything that came behind it in the pipeline and start over, so it’s best to change modes as little as possible, or not at all.

The actual rounding operation can be done in one instruction (the amusingly named fist op, which means “float-to-int store”), but there’s a snag. The ANSI C standard decrees that one and only one of these modes may ever be used for int a = float b: truncate. But, because the compiler can never be sure what rounding mode is set when you enter any particular function (you might have called into some library that set it differently), it would call a function called _ftol(), which set this mode each and every time a number was rounded. In fact, what it actually did for every cast was:

  1. Call into _ftol()
  2. Check the old rounding mode and save it to memory
  3. Set the rounding mode to “truncate” (this causes a pipeline clear)
  4. Round the number
  5. Set the rounding mode back to the one it saved in step one (another pipeline clear)
  6. Return from _ftol()

Because of this it wasn’t unusual to see a game spending over 6% of its time inside _ftol() alone. (In fact I can say with a straight face that I once saw a profile where a game spent fully 8% of each frame on fisting.) This is an extreme case of the compiler choosing correctness over speed.

You’re thinking the answer is “well, how about I just set the rounding mode to start with and tell the compiler not to obsess so much about the exact correctness?” and you’re right. The solution in MSVC is to supply the /QIfist compiler option, which tells the compiler to assume the current rounding mode is correct and simply issue the hardware float-to-int op directly. This saves you the function call and two pipeline clears. If your rounding mode gets changed elsewhere in the program you might get unexpected results, but… you know.. don’t do that.

Microsoft’s documentation claims that /QIfist is “deprecated” due to their floating-point code being much faster now, but if you try it out you’ll see they’re fibbing. What happens now is that they call to _ftol2_sse() which uses the modeless SSE conversion op cvttsd2si instead of old _ftol(). This has some advantages — you can pick between truncation and rounding for each operation without having to change the CPU’s rounding mode — it’s still a needless function call where an opcode would do, and it shuffles data between the FPU and SSE registers which brings us back to the LHS issue mentioned above. On my Intel Core2 PC, a simple test of calling the function below is twice as fast with compiler options /fp:fast /QIfist specified compared with only /fp:fast.

void test1(volatile int *a, volatile float *f)
{
  for (int i=0; i < 1000000 ; ++i)
    *a = (int) *f;
}

On the other hand, in an absolute sense _ftol2_sse() is pretty fast so it may be good enough.

It’s also possible to convert floats to ints by adding them to a certain magic number, but this isn’t always a benefit. In times of yore the fistp op was slow, so there was an advantage to replacing the fist with a fadd, but this doesn’t seem to be the case any more. It is faster than an implicit call to _ftol2_sse, and it has the advantage of not depending on the CPU’s current rounding mode (since you can pick your magic number to choose between rounding and truncation). On the other hand if you’ve specified /arch:sse2 and the compiler is using SSE scalar operations instead of the x87 FPU generally, then it’s faster to let it use the native cvttss2si op.

On the 360/PS3 CPU, the magic number technique is usually a performance hit, because most of the magic-number tricks involve an integer step on the floating-point number and run into register-partitioning issue mentioned above.

Further Reading

12 Comments

  1. cb says:

    Good article, but this -

    “It’s also possible to convert floats to ints by adding them to a certain magic number, but usually this isn’t a benefit. In times of yore the fistp op was slow, so there was an advantage to replacing the fist with a fadd, but this hasn’t been the case since the Pentium 3. Furthermore, most of the magic-number tricks involve an integer step on the floating-point number, which is disastrously slow because of the way the registers are partitioned.”

    is just not true on x86 PC CPU’s. My measurements roughly match the Stereopsis XS measurements – the magic number based rounding is still faster than fist by a little bit, and it has the advantage of not relying on compiler settings or FPU rounding mode. (of course on Xenon and Cell it’s a different story)

    note that’s only true for round-to-int, the XS truncs are a little bit slower than fist or cvtt2sse.

    Also the big advantage of the sse instructions is that you have truncate & round both available at any time without changing modes.

  2. Elan says:

    Hmm, let me test that again and if so I’ll revise the article. I tried it once on my Core2, but it’s possible the compiler may have actually optimized out my profile loop. Thanks for the note.

  3. Elan says:

    What hardware are you running, cb? I just tried a test that converts an array of 1024 floats 100,000 times (ie, performs 1.024×108 conversions while fitting in cache) and here are the results for my Intel Core2 @2.4ghz:


    _ftol2_sse versus magic-number truncation
    (4 trials)
    _Ftol2_sse magic number
    313.932ms 182.816ms
    319.619ms 181.992ms
    311.646ms 178.677ms
    310.179ms 177.646ms
    fistp (via /QIfist) versus magic-number truncation
    (4 trials)
    fistp magic number
    185.479ms 179.934ms
    180.314ms 182.722ms
    183.951ms 179.802ms
    178.270ms 178.007ms

    It looks like the magic-number trick varies between 1% slower and 2% faster than native fistp, which is basically the same.

  4. syskill says:

    I did some digging into whether and how this is handled in my world; it seems that the trick to avoiding fldcw in GCC is to use the lrint() function from the C99 standard, and then compile with -ffast-math so that it will be inlined.

    Maybe lrint() with /fp:fast will give the desired results in MSVC without resorting to deprecated switches?

  5. Elan says:

    Wow, MSVC fail:

    error C3861: ‘lrint’: identifier not found

    =(

  6. cb says:

    I’m a P3 and get numbers about the same as the stereopsis numbers (eg. the xs addition trick is slightly faster than fistp). My objection was to this part :

    “Furthermore, most of the magic-number tricks involve an integer step on the floating-point number, which is disastrously slow because of the way the registers are partitioned”

    which is true on consoles but not on x86. The big win of using the magic number trick on PC’s is that you can distribute code or put it in new projects and not worry about what compiler settings they’re using. Also the advantage over fistp is you don’t have to worry about how the FPU rounding mode is set.

    Note that the magic number truncate is really slow, though, only the magic number round-to-int is fast.

  7. Elan says:

    Thanks for the clarification, cb! I’ve corrected that passage.

  8. Zavie says:

    Thank you for this detailed and interesting read.
    I guess figures comes with experience, but this is a warning worth knowing.

  9. GameCoder.it − Il cast floatint says:

    [...] why-you-should-never-cast-floats-to-ints [1] fast-floating-point-to-integer-conversions [...]

  10. lubos says:

    Hey, thank you for this write up. I was profiling some code today using the AMD Code Analyst and I noticed that the single slowest operation in the program was the line with “i=(int)f”. My reaction was, what, why is that taking so long? So I started digging around. Gotta say I had absolutely no idea that casting float to int would take so long, it seems like such a simple operation.

  11. FeepingCreature says:

    I used your fastfloor code for my Simplex Noise impl. It works very well; thanks a bunch for the article :)

  12. Teaching the high-performance mindset » AltDevBlogADay (Staging Site) says:

    [...] resources to pass around: cast article and avoiding LHS using the restrict [...]

Leave a Reply