Posts tagged ‘LHS’

The seemingly simple act of assigning an int to a float variable, or vice versa, can be astonishingly expensive. In some architectures it may cost 40 cycles or even more! The reasons for this have to do with rounding and the load-hit-store issue, but the outcome is simple: never cast a float to an int inside a calculation “because the math will be faster” (it isn’t), and if you must convert a float to an int for storage, do it only once, at the end of the function, when you write the final outcome to memory.

This particular performance suck has two major sources: register shuffling and rounding. The first one has to do with hardware and affects all modern CPUs; the second is older and more specific to x86 compilers.

In both cases, one simple rule holds true: whenever you find yourself typing *((int *)(&anything)), you’re in for some pain.

Register Sets: The Multiple-Personality CPU

Casting floats to ints and back is something that can happen in routine tasks, like

int a = GetSystemTimeAsInt(); 
float seconds = a;

Or occasionally you may be tempted to perform bit-twiddling operations on a floating point number; for example, this might seem like a fast way to determine whether a float is positive or negative:

bool IsFloatPositive(float f)
{
    int signBit = *((int *)(&f)) & 0x80000000;
    return signBit == 0;
}

The problem here is that casting from one register type to another like this is an almost sure way to induce a load-hit-store. In the x86 and the PPC, integers, floating-point numbers, and SIMD vectors are kept in three separate register sets. As Becky Heineman wrote in Gamasutra, you can “think of the PowerPC as three completely separate CPUs, each with its own instruction set, register set, and ways of performing operations on the data.”

Integer operations (like add, sub, and bitwise ops) work only on integer registers, floating-point operations (like fadd, fmul, fsqrt) only on the FPU registers, and SIMD ops only touch vector registers. This is true of both the PowerPC and the x86 and nearly every modern CPU (with one exception, see below).

This makes the CPU designer’s job much easier (because each of these units can have a pipeline of different depth), but it means there is no means of directly moving data from one kind of register to another. There is no “move” operation that has one integer and one float operand. Basically, there are simply no wires that run directly between the int, float, and vector registers.

So, whenever you move data from an int to a float, the CPU first stores the integer from the int register to memory, and then in the next instruction reads from that memory address into the float register. This is the very definition of a load-hit-store stall, because that first store may take as many as 40 cycles to make it all the way out to the L1 cache, and the subsequent load can’t proceed until it has finished. On an in-order processor like the 360 or the PS3′s PPC, that means everything stops dead for between 40 and 80 cycles; on an out-of-order x86, the CPU will try to skip ahead to some of the subsequent instructions, but can usually only hide a little bit of that latency.

It’s not the actual conversion of ints to floats that is slow — const int *pA; float f = *pA; can happen in two cycles if the contents of pA are already in memory — but moving data between the different kinds of registers that is slow because the data has to get to and from the memory first.

What this all boils down to is that you should simply avoid mixing ints, floats, and vectors in the same calculation. So, for example, instead of

struct {int a,b} Foo;
float func( Foo *data )
{
  float x = fsqrt((( data->a << 1 ) - data->b));
  return x;
}

you are really better off with

float func( Foo *data )
{
  float fA = data->a;
  float fB = data->b;
  float x = fsqrt((( fA * 2.0f ) - fB ));
  return x;
}

More importantly, if you have wrapped your native SIMD type in a union like

union {
  __vector V;  // native 128-bit VMX register
  struct { float x,y,z,w; }
} vec4;

then you really need to avoid accessing the individual float members after working with it as a vector. Never do this:

vec4 A,B,C;
C = VectorCrossProductInVMX(A, B);
float x = C.y * A.x * 2.0f;
return x;

There is one notable exception: the Cell processor SPU as found in the PS3. In the SPUs, all operations — integers, floats, and vectors — operate on the same set of registers and you can mix them as much as you like.

Now, to put this in context, 80 cycles isn’t the end of the world. If you’re performing a hundred float-to-int casts per frame in high level functions, it’ll amount to less than a microsecond. On the other hand it only takes 20,000 such casts to eat up 5% of your frame on the 360, so if this is the sort of thing you’re doing in, say, AccumulateSoundBuffersByAddingTogetherEverySample(), it may be something to look at.

Rounding: Say Hello To My Little fist

In times gone by one of the easiest performance speedups to be had on the PC and its x86-based architecture was usually found in the way your program rounded floats.

The most obvious and least hardware-specific cost of int a = float b is that you somehow have to get rid of the fractional part of the number to turn it into a whole integer; in other words, the CPU has to turn 3.14159 into 3 without involving the Indiana State Legislature. That’s simple enough, but what if the number is 3.5 — do you round it up, or down? How about -3.5 — up or down? And how about on x86-based chips, where floating point numbers are calculated inside the CPU as 80 bits but an int is 32 bits?

At the time the Intel x87 floating-point coprocessor was invented, the IEEE 754 floating point standard specified that rounding could happen in one of four ways:

  1. Round to nearest – rounds to the nearest value; if the number falls midway it is rounded to the nearest even number
    ( 3.3 → 3 , 3.5 → 4, -2.5 → -2 )
  2. Round to zero – also known as truncate, simply throws away everything after the decimal point
    ( 3.3 → 3 , 3.5 → 3, -2.5 → -2 )
  3. Round up –
    ( 3.3 → 4 , 3.5 → 4, -2.5 → -2 )
  4. Round down –
    ( 3.3 → 3 , 3.5 → 3, -2.5 → -3 ).

The x87 allows you to select any of these modes by setting or clearing a couple of bits in a special control register. Reading and writing that register is a very slow operation, because it means the processor has to totally throw away anything that came behind it in the pipeline and start over, so it’s best to change modes as little as possible, or not at all.

The actual rounding operation can be done in one instruction (the amusingly named fist op, which means “float-to-int store”), but there’s a snag. The ANSI C standard decrees that one and only one of these modes may ever be used for int a = float b: truncate. But, because the compiler can never be sure what rounding mode is set when you enter any particular function (you might have called into some library that set it differently), it would call a function called _ftol(), which set this mode each and every time a number was rounded. In fact, what it actually did for every cast was:

  1. Call into _ftol()
  2. Check the old rounding mode and save it to memory
  3. Set the rounding mode to “truncate” (this causes a pipeline clear)
  4. Round the number
  5. Set the rounding mode back to the one it saved in step one (another pipeline clear)
  6. Return from _ftol()

Because of this it wasn’t unusual to see a game spending over 6% of its time inside _ftol() alone. (In fact I can say with a straight face that I once saw a profile where a game spent fully 8% of each frame on fisting.) This is an extreme case of the compiler choosing correctness over speed.

You’re thinking the answer is “well, how about I just set the rounding mode to start with and tell the compiler not to obsess so much about the exact correctness?” and you’re right. The solution in MSVC is to supply the /QIfist compiler option, which tells the compiler to assume the current rounding mode is correct and simply issue the hardware float-to-int op directly. This saves you the function call and two pipeline clears. If your rounding mode gets changed elsewhere in the program you might get unexpected results, but… you know.. don’t do that.

Microsoft’s documentation claims that /QIfist is “deprecated” due to their floating-point code being much faster now, but if you try it out you’ll see they’re fibbing. What happens now is that they call to _ftol2_sse() which uses the modeless SSE conversion op cvttsd2si instead of old _ftol(). This has some advantages — you can pick between truncation and rounding for each operation without having to change the CPU’s rounding mode — it’s still a needless function call where an opcode would do, and it shuffles data between the FPU and SSE registers which brings us back to the LHS issue mentioned above. On my Intel Core2 PC, a simple test of calling the function below is twice as fast with compiler options /fp:fast /QIfist specified compared with only /fp:fast.

void test1(volatile int *a, volatile float *f)
{
  for (int i=0; i < 1000000 ; ++i)
    *a = (int) *f;
}

On the other hand, in an absolute sense _ftol2_sse() is pretty fast so it may be good enough.

It’s also possible to convert floats to ints by adding them to a certain magic number, but this isn’t always a benefit. In times of yore the fistp op was slow, so there was an advantage to replacing the fist with a fadd, but this doesn’t seem to be the case any more. It is faster than an implicit call to _ftol2_sse, and it has the advantage of not depending on the CPU’s current rounding mode (since you can pick your magic number to choose between rounding and truncation). On the other hand if you’ve specified /arch:sse2 and the compiler is using SSE scalar operations instead of the x87 FPU generally, then it’s faster to let it use the native cvttss2si op.

On the 360/PS3 CPU, the magic number technique is usually a performance hit, because most of the magic-number tricks involve an integer step on the floating-point number and run into register-partitioning issue mentioned above.

Further Reading

An anonymous commenter asks:

I thought that only compatable types could alias? So for example, an int array can’t alias a float array, and a CFoo* can’t alias it’s own member?

Something to remember about the __restrict keyword is that it doesn’t affect how the compiler itself aliases pointers: instead, it’s about what the compiler can assume you, the progammer, may or may not have aliased.

If there is any chance two pointers alias each other, the compiler is forced to reload the contents of each from memory after every store operation.

In the case of class member functions, the compiler of course knows that different data members of this can’t alias one another. The issue with MemberFunc in CFoo below isn’t whether &m_iBar and &m_iBaz alias one another, but whether a == &m_iBar.

class CFoo
{
public:
	int MemberFunc( int *a, int x );
	int MemberFuncWithRestrict( int * __restrict a, int x );
	int m_iBar;
	int m_iBaz;
};
 
int CFoo::MemberFunc( int *a, int x )
{
	m_iBar = x;
	*a = 7; // if a == &amp;m_iBar, then m_iBar might be 7 now.
	        // the processor must load it back from memory.
	return m_iBar + *a; // load-hit-store
}

MemberFunc compiles to something like:

; int CFoo::MemberFunc( int *a, int b );
; by convention, CFoo * this is on register 3,
; int *a is on register 4, and
; x is on register 5
li r10,7       ; set r10 to 7
stw  r5,0(r3)  ; store r11 to this->m_iBar
stw  r10,0(r4) ; store r10 through pointer a
; now, because a might or might not point to m_iBar,
; we need to load back the contents of m_iBar just in case.
; the compiler *cannot* assume that a != &m_iBar, and
; the load below will stall until the previous store to
; m_iBar is completely finished -- at least forty cycles!
lwz r11,0(r3)  ; load this->m_iBar from memory to r11
add r3,r11,r10 ; r3 = r11 + r10
blr            ; return r3

Now, if we use the __restrict keyword to promise the compiler that a doesn’t alias anything under this — in particular, that a != &m_iBar, it can make some assumptions and avoid loading m_iBar back from memory after it has been stored:

int CFoo::MemberFuncWithRestrict( int * __restrict a, int x ) __restrict
{
	m_iBar = x;
	*a = 7; // __restrict THIS promises that a != &amp;m_iBar
	return m_iBar + *a; // no load-hit-store
}

compiles to

li r10,7        ; set r10 to 7
stw r5,0(r3)    ; store r5 to this->m_iBar
stw r10,0(r4)   ; store r10 through pointer a
;; because the compiler knows that the write to a
;; cannot affect the contents of m_iBar, there is
;; no need for it to load m_iBar back into memory;
;; the store above cannot have changed it
add r3, r5, r10 ; r3 = r5 + r10
blr             ; return r3

This function runs much faster, avoiding the load-hit store on reading back m_iBar, but it is only correct so long as the promise is true. If you call the function like this:

CFoo foo;
out = foo.MemberFuncWithRestrict( &amp;foo.m_iBar, 5 );

you will get incorrect results, in this case a return value of 12 instead of 14.

Even if two pointers are different types, the compiler still has no way of knowing whether they actually point to different locations in memory.

EDIT: the following section is only true if your compiler does not have strict aliasing turned on. In GCC ≥4 it is on by default unless you specify -f-no-strict-aliasing. In MSVC it is off by default and I could not find how to turn it on. Mike Acton discusses strict aliasing in great detail here. Thanks, Arseny!

For example, consider the function

float slow( int *a, float *b )
{
	*a = 5;
	*b = 7.0;
	return *a+*b;
}

This function could be called from a different cpp in a way that aliased the pointers to the same location:

int foo[8]; float bar[8];
bar[0] = slow( foo + 0, bar + 0 ); // pointers are not aliased
foo[0] = slow( foo + 1, (float *)(foo + 1) ); // pointers are aliased

Granted, this is a contrived example, but also one that’s completely legal. It underscores how the compiler can’t make any assumptions about pointer aliasing unless you hold its hand. Even though a and b are different types, the programmer can still deliberately alias them. The function slow above compiles to something like this on the PowerPC:

; assembly for float slow( int *a, float *b )
; by convention, parameter a is stored on r3 and parameter b on r4.
lis r11,__real@40e00000 ; load address of constant 7.0f
li           r10,5     ; load 5 onto r10
stw          r10,0(r3) ; save r10 through pointer a
lfs          fr0,__real@40e00000(r11) ; load 7.0f onto fr0
stfs         fr0,0(r4) ; save 7.0f through pointer b
;; the following line causes the load-hit store:
;; because a and b might be aliased, the compiler must
;; now load back through pointer a in case the write to
;; b overwrote its contents!
;; the following instruction will cause a pipeline stall.
lwz          r9,0(r3)    ; load integer pointed to by a
std          r9,-10h(r1) ; convert it to a float and save to memory
;; here is another load-hit-store: most processors have no way
;; to move data between integer and floating point registers
;; other than by saving to memory and loading back again.
;; this load will stall the pipeline:
lfd          fr13,-10h(r1) ; load from that address again into fr13
fadds        fr1,fr13,fr0  ; add a and b and
blr                        ; return on fr1

This simple function has two load hit store stalls, each of which stalls the processor for the entire length of the pipeline. By contrast, if we write our function like this:

float slowWithRestrict( int * __restrict a, float * __restrict b )
{
	*a = 5;
	*b = 7.0;
	return *a+*b;
}

… then we will get incorrect results if a and b happen to point to the same address, but in every other case it will avoid stalls completely. This is because we have promised the compiler that the write to b cannot possibly overwrite the contents of *a, something it could not know otherwise. It compiles to something like this:

; assembly for float slowWithRestrict( )
; pointer a is on r3. pointer b is on r4.
lis r10,__real@40e00000 ; load address of constant "7.0f" to r10
lis r11,__real@41400000 ; load address of constant "12.0f" to r11
li  r9,5                ; set r9 to "5"
stw r9,0(r3)            ; store r9 through pointer a
lfs fr0,__real@40e00000(r10) ; load constant "7.0" from memory onto fr0
lfs fr1,__real@41400000(r11) ; load constant "12.0" from memory onto fr1
stfs fr0,0(r4)          ; store fr0 through pointer b
blr                     ; return fr1

No load-hit-stores at all.

The load-hit-store is one of those quirky CPU implementation details that can cause significant performance problems in high-level code without it really being clear why, especially on in-order cores such as the PowerPCs inside the 360 and PS3. It’s especially insidious because it’s exactly the sort of thing we expect our compilers to transparently handle for us, and yet the compiler can’t handle correctly without explicit hinting from the programmer. Fortunately, once you know what’s going on underneath the hood, there is a C++ keyword that helps avoid this problem without having to resort to inline assembly and darker arts.

Basically, a load-hit-store is a large stall that occurs when the CPU writes data to an address x and then tries to load that data from x again too soon afterwards. The reason for this problem has to do with the deep instruction pipeline of a modern processor, but the consequence is clear: the processor’s execution comes to a complete halt for between 40 and 80 cycles. The simplest way to think of it is is that the data has to “bounce off of the L1 cache”.  That is, when you write data from a CPU register to memory, the ”store” operation has to complete, writing the data all the way out to main memory, before the data can be read back into a register again. Here’s a trivial example in C:

int CauseLHS(int *ptrA)
{
   int a, b;
   int *ptrB = ptrA; // B and A point to the same address
   *ptrA = 5; // write data to address ptrA
   b = *ptrB; // read that data back in again (won't be
                  // available for 80 cycles)
   a = b + 10; // stalls! the data isn't available yet.
}

This seems like the sort of thing the compiler should notice and fix by simply keeping the contents of *ptrA in a register, but in most real-world cases the compiler won’t be able to tell that ptrA and ptrB point to the same address. Thus, it’s obliged to read memory back from a pointer every time you dereference it, because any other pointer in the function might have aliased and modified that data. There is, however, a way to help out the compiler a little: the __restrict keyword.

__restrict is a compiler directive that helps avoid load-hit-store stalls.

__restrict on a pointer promises the compiler that it has no aliases: nothing else in the function points to that same data. Thus the compiler knows that if it writes data to a pointer, it doesn’t need to read it back into a register later on because nothing else could have written to that address. Without __restrict, the compiler is forced to read data from every pointer every time it is used, because another pointer may have aliased x.

For example, this code will run slowly:

int slow(int *a, int *b)
{
   *a = 5;
   *b = 7;
   return *a + *b; // LHS stall: the compiler doesn't
                  // know whether a == b, so it has to
                  //  reload both before the add
}

Whereas this code will run quickly:

int fast(int * __restrict a, int * __restrict b)
{
   *a = 5;
   *b = 7; // RESTRICT promises that a != b
   return *a + *b; // no stall; the compiler hangs onto
                         // 5 and 7 in the registers.
}

It bears repeating that __restrict is a promise you make to your compiler. If you break your promise, you can get incorrect results. If pointers pA and pB are __restrict, and pA == pB, that will cause mysterious bugs.

For example, in the fast() function above, if a does equal b, then the correct return value would be 14. However, the compiler won’t know that *b = 7 has changed the value of *a, and so you might actually get a return value of 12.

When working with C++ member functions, you can tell the compiler that the implicit this pointer is __restrict like so:

struct CFoo
{
  int MemberFunc();
  int m_iBar;
}
int CFoo::MemberFunc( int *a ) __restrict
{
  m_iBar = 5;
  *a = 7; // __restrict THIS promises that a != &amp;m_iBar
  return m_iBar + 12; // no load-hit-store
}

There is unfortunately no way to mark a C++ reference as __restrict, so function parameters declared like int foo(int &a, int &b) cannot benefit from __restrict. In those cases, either copy a and b to local variables inside your function, then write the final values back out again at the end; or change your function signature to use pointers instead.

In summary, __restrict makes code involving pointers faster, so long as the pointers never alias one another. It is the usual first step when your profiler complains that a function has too many load-hit-stores, but it is not magic: it depends on the programmer’s attention to where all the data is going, and on that no-alias promise being kept.

See also the followup to this article here.