The seemingly simple act of assigning an int to a float variable, or vice versa, can be astonishingly expensive. In some architectures it may cost 40 cycles or even more! The reasons for this have to do with rounding and the load-hit-store issue, but the outcome is simple: never cast a float to an int inside a calculation “because the math will be faster” (it isn’t), and if you must convert a float to an int for storage, do it only once, at the end of the function, when you write the final outcome to memory.
This particular performance suck has two major sources: register shuffling and rounding. The first one has to do with hardware and affects all modern CPUs; the second is older and more specific to x86 compilers.
In both cases, one simple rule holds true: whenever you find yourself typing *((int *)(&anything)), you’re in for some pain.
Register Sets: The Multiple-Personality CPU
Casting floats to ints and back is something that can happen in routine tasks, like
int a = GetSystemTimeAsInt(); float seconds = a;
Or occasionally you may be tempted to perform bit-twiddling operations on a floating point number; for example, this might seem like a fast way to determine whether a float is positive or negative:
bool IsFloatPositive(float f) { int signBit = *((int *)(&f)) & 0x80000000; return signBit == 0; }
The problem here is that casting from one register type to another like this is an almost sure way to induce a load-hit-store. In the x86 and the PPC, integers, floating-point numbers, and SIMD vectors are kept in three separate register sets. As Becky Heineman wrote in Gamasutra, you can “think of the PowerPC as three completely separate CPUs, each with its own instruction set, register set, and ways of performing operations on the data.”
Integer operations (like add, sub, and bitwise ops) work only on integer registers, floating-point operations (like fadd, fmul, fsqrt) only on the FPU registers, and SIMD ops only touch vector registers. This is true of both the PowerPC and the x86 and nearly every modern CPU (with one exception, see below).
This makes the CPU designer’s job much easier (because each of these units can have a pipeline of different depth), but it means there is no means of directly moving data from one kind of register to another. There is no “move” operation that has one integer and one float operand. Basically, there are simply no wires that run directly between the int, float, and vector registers.
So, whenever you move data from an int to a float, the CPU first stores the integer from the int register to memory, and then in the next instruction reads from that memory address into the float register. This is the very definition of a load-hit-store stall, because that first store may take as many as 40 cycles to make it all the way out to the L1 cache, and the subsequent load can’t proceed until it has finished. On an in-order processor like the 360 or the PS3′s PPC, that means everything stops dead for between 40 and 80 cycles; on an out-of-order x86, the CPU will try to skip ahead to some of the subsequent instructions, but can usually only hide a little bit of that latency.
It’s not the actual conversion of ints to floats that is slow — const int *pA; float f = *pA; can happen in two cycles if the contents of pA are already in memory — but moving data between the different kinds of registers that is slow because the data has to get to and from the memory first.
What this all boils down to is that you should simply avoid mixing ints, floats, and vectors in the same calculation. So, for example, instead of
struct {int a,b} Foo; float func( Foo *data ) { float x = fsqrt((( data->a << 1 ) - data->b)); return x; }
you are really better off with
float func( Foo *data ) { float fA = data->a; float fB = data->b; float x = fsqrt((( fA * 2.0f ) - fB )); return x; }
More importantly, if you have wrapped your native SIMD type in a union like
union { __vector V; // native 128-bit VMX register struct { float x,y,z,w; } } vec4; |
then you really need to avoid accessing the individual float members after working with it as a vector. Never do this:
vec4 A,B,C; C = VectorCrossProductInVMX(A, B); float x = C.y * A.x * 2.0f; return x;
There is one notable exception: the Cell processor SPU as found in the PS3. In the SPUs, all operations — integers, floats, and vectors — operate on the same set of registers and you can mix them as much as you like.
Now, to put this in context, 80 cycles isn’t the end of the world. If you’re performing a hundred float-to-int casts per frame in high level functions, it’ll amount to less than a microsecond. On the other hand it only takes 20,000 such casts to eat up 5% of your frame on the 360, so if this is the sort of thing you’re doing in, say, AccumulateSoundBuffersByAddingTogetherEverySample(), it may be something to look at.
Rounding: Say Hello To My Little fist
In times gone by one of the easiest performance speedups to be had on the PC and its x86-based architecture was usually found in the way your program rounded floats.
The most obvious and least hardware-specific cost of int a = float b is that you somehow have to get rid of the fractional part of the number to turn it into a whole integer; in other words, the CPU has to turn 3.14159 into 3 without involving the Indiana State Legislature. That’s simple enough, but what if the number is 3.5 — do you round it up, or down? How about -3.5 — up or down? And how about on x86-based chips, where floating point numbers are calculated inside the CPU as 80 bits but an int is 32 bits?
At the time the Intel x87 floating-point coprocessor was invented, the IEEE 754 floating point standard specified that rounding could happen in one of four ways:
- Round to nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest even number
( 3.3 → 3 , 3.5 → 4, -2.5 → -2 ) - Round to zero – also known as truncate, simply throws away everything after the decimal point
( 3.3 → 3 , 3.5 → 3, -2.5 → -2 ) - Round up –
( 3.3 → 4 , 3.5 → 4, -2.5 → -2 ) - Round down –
( 3.3 → 3 , 3.5 → 3, -2.5 → -3 ).
The x87 allows you to select any of these modes by setting or clearing a couple of bits in a special control register. Reading and writing that register is a very slow operation, because it means the processor has to totally throw away anything that came behind it in the pipeline and start over, so it’s best to change modes as little as possible, or not at all.
The actual rounding operation can be done in one instruction (the amusingly named fist op, which means “float-to-int store”), but there’s a snag. The ANSI C standard decrees that one and only one of these modes may ever be used for int a = float b: truncate. But, because the compiler can never be sure what rounding mode is set when you enter any particular function (you might have called into some library that set it differently), it would call a function called _ftol(), which set this mode each and every time a number was rounded. In fact, what it actually did for every cast was:
- Call into _ftol()
- Check the old rounding mode and save it to memory
- Set the rounding mode to “truncate” (this causes a pipeline clear)
- Round the number
- Set the rounding mode back to the one it saved in step one (another pipeline clear)
- Return from _ftol()
Because of this it wasn’t unusual to see a game spending over 6% of its time inside _ftol() alone. (In fact I can say with a straight face that I once saw a profile where a game spent fully 8% of each frame on fisting.) This is an extreme case of the compiler choosing correctness over speed.
You’re thinking the answer is “well, how about I just set the rounding mode to start with and tell the compiler not to obsess so much about the exact correctness?” and you’re right. The solution in MSVC is to supply the /QIfist compiler option, which tells the compiler to assume the current rounding mode is correct and simply issue the hardware float-to-int op directly. This saves you the function call and two pipeline clears. If your rounding mode gets changed elsewhere in the program you might get unexpected results, but… you know.. don’t do that.
Microsoft’s documentation claims that /QIfist is “deprecated” due to their floating-point code being much faster now, but if you try it out you’ll see they’re fibbing. What happens now is that they call to _ftol2_sse() which uses the modeless SSE conversion op cvttsd2si instead of old _ftol(). This has some advantages — you can pick between truncation and rounding for each operation without having to change the CPU’s rounding mode — it’s still a needless function call where an opcode would do, and it shuffles data between the FPU and SSE registers which brings us back to the LHS issue mentioned above. On my Intel Core2 PC, a simple test of calling the function below is twice as fast with compiler options /fp:fast /QIfist specified compared with only /fp:fast.
void test1(volatile int *a, volatile float *f) { for (int i=0; i < 1000000 ; ++i) *a = (int) *f; }
On the other hand, in an absolute sense _ftol2_sse() is pretty fast so it may be good enough.
It’s also possible to convert floats to ints by adding them to a certain magic number, but this isn’t always a benefit. In times of yore the fistp op was slow, so there was an advantage to replacing the fist with a fadd, but this doesn’t seem to be the case any more. It is faster than an implicit call to _ftol2_sse, and it has the advantage of not depending on the CPU’s current rounding mode (since you can pick your magic number to choose between rounding and truncation). On the other hand if you’ve specified /arch:sse2 and the compiler is using SSE scalar operations instead of the x87 FPU generally, then it’s faster to let it use the native cvttss2si op.
On the 360/PS3 CPU, the magic number technique is usually a performance hit, because most of the magic-number tricks involve an integer step on the floating-point number and run into register-partitioning issue mentioned above.
Further Reading
- My primary source for the register-type load-hit-store issue was the XDK documentation; consult your own for more details.
- David Goldberg’s What Every Computer Scientist Should Know About Floating-Point Arithmetic is the definitive survival guide for the IEEE754 wilds.
- You can learn more about the “magic integer” technique for converting between floats and ints here and here, and I also found an interesting article about the state of affairs circa 2000.