Posts tagged ‘C++’

An anonymous commenter asks:

I thought that only compatable types could alias? So for example, an int array can’t alias a float array, and a CFoo* can’t alias it’s own member?

Something to remember about the __restrict keyword is that it doesn’t affect how the compiler itself aliases pointers: instead, it’s about what the compiler can assume you, the progammer, may or may not have aliased.

If there is any chance two pointers alias each other, the compiler is forced to reload the contents of each from memory after every store operation.

In the case of class member functions, the compiler of course knows that different data members of this can’t alias one another. The issue with MemberFunc in CFoo below isn’t whether &m_iBar and &m_iBaz alias one another, but whether a == &m_iBar.

class CFoo
{
public:
	int MemberFunc( int *a, int x );
	int MemberFuncWithRestrict( int * __restrict a, int x );
	int m_iBar;
	int m_iBaz;
};
 
int CFoo::MemberFunc( int *a, int x )
{
	m_iBar = x;
	*a = 7; // if a == &m_iBar, then m_iBar might be 7 now.
	        // the processor must load it back from memory.
	return m_iBar + *a; // load-hit-store
}

MemberFunc compiles to something like:

; int CFoo::MemberFunc( int *a, int b );
; by convention, CFoo * this is on register 3,
; int *a is on register 4, and
; x is on register 5
li r10,7       ; set r10 to 7
stw  r5,0(r3)  ; store r11 to this->m_iBar
stw  r10,0(r4) ; store r10 through pointer a
; now, because a might or might not point to m_iBar,
; we need to load back the contents of m_iBar just in case.
; the compiler *cannot* assume that a != &m_iBar, and
; the load below will stall until the previous store to
; m_iBar is completely finished -- at least forty cycles!
lwz r11,0(r3)  ; load this->m_iBar from memory to r11
add r3,r11,r10 ; r3 = r11 + r10
blr            ; return r3

Now, if we use the __restrict keyword to promise the compiler that a doesn’t alias anything under this — in particular, that a != &m_iBar, it can make some assumptions and avoid loading m_iBar back from memory after it has been stored:

int CFoo::MemberFuncWithRestrict( int * __restrict a, int x ) __restrict
{
	m_iBar = x;
	*a = 7; // __restrict THIS promises that a != &m_iBar
	return m_iBar + *a; // no load-hit-store
}

compiles to

li r10,7        ; set r10 to 7
stw r5,0(r3)    ; store r5 to this->m_iBar
stw r10,0(r4)   ; store r10 through pointer a
;; because the compiler knows that the write to a
;; cannot affect the contents of m_iBar, there is
;; no need for it to load m_iBar back into memory;
;; the store above cannot have changed it
add r3, r5, r10 ; r3 = r5 + r10
blr             ; return r3

This function runs much faster, avoiding the load-hit store on reading back m_iBar, but it is only correct so long as the promise is true. If you call the function like this:

CFoo foo;
out = foo.MemberFuncWithRestrict( &foo.m_iBar, 5 );

you will get incorrect results, in this case a return value of 12 instead of 14.

Even if two pointers are different types, the compiler still has no way of knowing whether they actually point to different locations in memory.

EDIT: the following section is only true if your compiler does not have strict aliasing turned on. In GCC ≥4 it is on by default unless you specify -f-no-strict-aliasing. In MSVC it is off by default and I could not find how to turn it on. Mike Acton discusses strict aliasing in great detail here. Thanks, Arseny!

For example, consider the function

float slow( int *a, float *b )
{
	*a = 5;
	*b = 7.0;
	return *a+*b;
}

This function could be called from a different cpp in a way that aliased the pointers to the same location:

int foo[8]; float bar[8];
bar[0] = slow( foo + 0, bar + 0 ); // pointers are not aliased
foo[0] = slow( foo + 1, (float *)(foo + 1) ); // pointers are aliased

Granted, this is a contrived example, but also one that’s completely legal. It underscores how the compiler can’t make any assumptions about pointer aliasing unless you hold its hand. Even though a and b are different types, the programmer can still deliberately alias them. The function slow above compiles to something like this on the PowerPC:

; assembly for float slow( int *a, float *b )
; by convention, parameter a is stored on r3 and parameter b on r4.
lis r11,__real@40e00000 ; load address of constant 7.0f
li           r10,5     ; load 5 onto r10
stw          r10,0(r3) ; save r10 through pointer a
lfs          fr0,__real@40e00000(r11) ; load 7.0f onto fr0
stfs         fr0,0(r4) ; save 7.0f through pointer b
;; the following line causes the load-hit store:
;; because a and b might be aliased, the compiler must
;; now load back through pointer a in case the write to
;; b overwrote its contents!
;; the following instruction will cause a pipeline stall.
lwz          r9,0(r3)    ; load integer pointed to by a
std          r9,-10h(r1) ; convert it to a float and save to memory
;; here is another load-hit-store: most processors have no way
;; to move data between integer and floating point registers
;; other than by saving to memory and loading back again.
;; this load will stall the pipeline:
lfd          fr13,-10h(r1) ; load from that address again into fr13
fadds        fr1,fr13,fr0  ; add a and b and
blr                        ; return on fr1

This simple function has two load hit store stalls, each of which stalls the processor for the entire length of the pipeline. By contrast, if we write our function like this:

float slowWithRestrict( int * __restrict a, float * __restrict b )
{
	*a = 5;
	*b = 7.0;
	return *a+*b;
}

… then we will get incorrect results if a and b happen to point to the same address, but in every other case it will avoid stalls completely. This is because we have promised the compiler that the write to b cannot possibly overwrite the contents of *a, something it could not know otherwise. It compiles to something like this:

; assembly for float slowWithRestrict( )
; pointer a is on r3. pointer b is on r4.
lis r10,__real@40e00000 ; load address of constant "7.0f" to r10
lis r11,__real@41400000 ; load address of constant "12.0f" to r11
li  r9,5                ; set r9 to "5"
stw r9,0(r3)            ; store r9 through pointer a
lfs fr0,__real@40e00000(r10) ; load constant "7.0" from memory onto fr0
lfs fr1,__real@41400000(r11) ; load constant "12.0" from memory onto fr1
stfs fr0,0(r4)          ; store fr0 through pointer b
blr                     ; return fr1

No load-hit-stores at all.

The load-hit-store is one of those quirky CPU implementation details that can cause significant performance problems in high-level code without it really being clear why, especially on in-order cores such as the PowerPCs inside the 360 and PS3. It’s especially insidious because it’s exactly the sort of thing we expect our compilers to transparently handle for us, and yet the compiler can’t handle correctly without explicit hinting from the programmer. Fortunately, once you know what’s going on underneath the hood, there is a C++ keyword that helps avoid this problem without having to resort to inline assembly and darker arts.

Basically, a load-hit-store is a large stall that occurs when the CPU writes data to an address x and then tries to load that data from x again too soon afterwards. The reason for this problem has to do with the deep instruction pipeline of a modern processor, but the consequence is clear: the processor’s execution comes to a complete halt for between 40 and 80 cycles. The simplest way to think of it is is that the data has to “bounce off of the L1 cache”.  That is, when you write data from a CPU register to memory, the ”store” operation has to complete, writing the data all the way out to main memory, before the data can be read back into a register again. Here’s a trivial example in C:

int CauseLHS(int *ptrA)
{
   int a, b;
   int *ptrB = ptrA; // B and A point to the same address
   *ptrA = 5; // write data to address ptrA
   b = *ptrB; // read that data back in again (won't be
                  // available for 80 cycles)
   a = b + 10; // stalls! the data isn't available yet.
}

This seems like the sort of thing the compiler should notice and fix by simply keeping the contents of *ptrA in a register, but in most real-world cases the compiler won’t be able to tell that ptrA and ptrB point to the same address. Thus, it’s obliged to read memory back from a pointer every time you dereference it, because any other pointer in the function might have aliased and modified that data. There is, however, a way to help out the compiler a little: the __restrict keyword.

__restrict is a compiler directive that helps avoid load-hit-store stalls.

__restrict on a pointer promises the compiler that it has no aliases: nothing else in the function points to that same data. Thus the compiler knows that if it writes data to a pointer, it doesn’t need to read it back into a register later on because nothing else could have written to that address. Without __restrict, the compiler is forced to read data from every pointer every time it is used, because another pointer may have aliased x.

For example, this code will run slowly:

int slow(int *a, int *b)
{
   *a = 5;
   *b = 7;
   return *a + *b; // LHS stall: the compiler doesn't
                  // know whether a == b, so it has to
                  //  reload both before the add
}

Whereas this code will run quickly:

int fast(int * __restrict a, int * __restrict b)
{
   *a = 5;
   *b = 7; // RESTRICT promises that a != b
   return *a + *b; // no stall; the compiler hangs onto
                         // 5 and 7 in the registers.
}

It bears repeating that __restrict is a promise you make to your compiler. If you break your promise, you can get incorrect results. If pointers pA and pB are __restrict, and pA == pB, that will cause mysterious bugs.

For example, in the fast() function above, if a does equal b, then the correct return value would be 14. However, the compiler won’t know that *b = 7 has changed the value of *a, and so you might actually get a return value of 12.

When working with C++ member functions, you can tell the compiler that the implicit this pointer is __restrict like so:

struct CFoo
{
  int MemberFunc();
  int m_iBar;
}
int CFoo::MemberFunc( int *a ) __restrict
{
  m_iBar = 5;
  *a = 7; // __restrict THIS promises that a != &m_iBar
  return m_iBar + 12; // no load-hit-store
}

There is unfortunately no way to mark a C++ reference as __restrict, so function parameters declared like int foo(int &a, int &b) cannot benefit from __restrict. In those cases, either copy a and b to local variables inside your function, then write the final values back out again at the end; or change your function signature to use pointers instead.

In summary, __restrict makes code involving pointers faster, so long as the pointers never alias one another. It is the usual first step when your profiler complains that a function has too many load-hit-stores, but it is not magic: it depends on the programmer’s attention to where all the data is going, and on that no-alias promise being kept.

See also the followup to this article here.