Archive for the ‘Programming’ Category

An anonymous commenter asks:

I thought that only compatable types could alias? So for example, an int array can’t alias a float array, and a CFoo* can’t alias it’s own member?

Something to remember about the __restrict keyword is that it doesn’t affect how the compiler itself aliases pointers: instead, it’s about what the compiler can assume you, the progammer, may or may not have aliased.

If there is any chance two pointers alias each other, the compiler is forced to reload the contents of each from memory after every store operation.

In the case of class member functions, the compiler of course knows that different data members of this can’t alias one another. The issue with MemberFunc in CFoo below isn’t whether &m_iBar and &m_iBaz alias one another, but whether a == &m_iBar.

class CFoo
{
public:
	int MemberFunc( int *a, int x );
	int MemberFuncWithRestrict( int * __restrict a, int x );
	int m_iBar;
	int m_iBaz;
};
 
int CFoo::MemberFunc( int *a, int x )
{
	m_iBar = x;
	*a = 7; // if a == &m_iBar, then m_iBar might be 7 now.
	        // the processor must load it back from memory.
	return m_iBar + *a; // load-hit-store
}

MemberFunc compiles to something like:

; int CFoo::MemberFunc( int *a, int b );
; by convention, CFoo * this is on register 3,
; int *a is on register 4, and
; x is on register 5
li r10,7       ; set r10 to 7
stw  r5,0(r3)  ; store r11 to this->m_iBar
stw  r10,0(r4) ; store r10 through pointer a
; now, because a might or might not point to m_iBar,
; we need to load back the contents of m_iBar just in case.
; the compiler *cannot* assume that a != &m_iBar, and
; the load below will stall until the previous store to
; m_iBar is completely finished -- at least forty cycles!
lwz r11,0(r3)  ; load this->m_iBar from memory to r11
add r3,r11,r10 ; r3 = r11 + r10
blr            ; return r3

Now, if we use the __restrict keyword to promise the compiler that a doesn’t alias anything under this — in particular, that a != &m_iBar, it can make some assumptions and avoid loading m_iBar back from memory after it has been stored:

int CFoo::MemberFuncWithRestrict( int * __restrict a, int x ) __restrict
{
	m_iBar = x;
	*a = 7; // __restrict THIS promises that a != &m_iBar
	return m_iBar + *a; // no load-hit-store
}

compiles to

li r10,7        ; set r10 to 7
stw r5,0(r3)    ; store r5 to this->m_iBar
stw r10,0(r4)   ; store r10 through pointer a
;; because the compiler knows that the write to a
;; cannot affect the contents of m_iBar, there is
;; no need for it to load m_iBar back into memory;
;; the store above cannot have changed it
add r3, r5, r10 ; r3 = r5 + r10
blr             ; return r3

This function runs much faster, avoiding the load-hit store on reading back m_iBar, but it is only correct so long as the promise is true. If you call the function like this:

CFoo foo;
out = foo.MemberFuncWithRestrict( &foo.m_iBar, 5 );

you will get incorrect results, in this case a return value of 12 instead of 14.

Even if two pointers are different types, the compiler still has no way of knowing whether they actually point to different locations in memory.

EDIT: the following section is only true if your compiler does not have strict aliasing turned on. In GCC ≥4 it is on by default unless you specify -f-no-strict-aliasing. In MSVC it is off by default and I could not find how to turn it on. Mike Acton discusses strict aliasing in great detail here. Thanks, Arseny!

For example, consider the function

float slow( int *a, float *b )
{
	*a = 5;
	*b = 7.0;
	return *a+*b;
}

This function could be called from a different cpp in a way that aliased the pointers to the same location:

int foo[8]; float bar[8];
bar[0] = slow( foo + 0, bar + 0 ); // pointers are not aliased
foo[0] = slow( foo + 1, (float *)(foo + 1) ); // pointers are aliased

Granted, this is a contrived example, but also one that’s completely legal. It underscores how the compiler can’t make any assumptions about pointer aliasing unless you hold its hand. Even though a and b are different types, the programmer can still deliberately alias them. The function slow above compiles to something like this on the PowerPC:

; assembly for float slow( int *a, float *b )
; by convention, parameter a is stored on r3 and parameter b on r4.
lis r11,__real@40e00000 ; load address of constant 7.0f
li           r10,5     ; load 5 onto r10
stw          r10,0(r3) ; save r10 through pointer a
lfs          fr0,__real@40e00000(r11) ; load 7.0f onto fr0
stfs         fr0,0(r4) ; save 7.0f through pointer b
;; the following line causes the load-hit store:
;; because a and b might be aliased, the compiler must
;; now load back through pointer a in case the write to
;; b overwrote its contents!
;; the following instruction will cause a pipeline stall.
lwz          r9,0(r3)    ; load integer pointed to by a
std          r9,-10h(r1) ; convert it to a float and save to memory
;; here is another load-hit-store: most processors have no way
;; to move data between integer and floating point registers
;; other than by saving to memory and loading back again.
;; this load will stall the pipeline:
lfd          fr13,-10h(r1) ; load from that address again into fr13
fadds        fr1,fr13,fr0  ; add a and b and
blr                        ; return on fr1

This simple function has two load hit store stalls, each of which stalls the processor for the entire length of the pipeline. By contrast, if we write our function like this:

float slowWithRestrict( int * __restrict a, float * __restrict b )
{
	*a = 5;
	*b = 7.0;
	return *a+*b;
}

… then we will get incorrect results if a and b happen to point to the same address, but in every other case it will avoid stalls completely. This is because we have promised the compiler that the write to b cannot possibly overwrite the contents of *a, something it could not know otherwise. It compiles to something like this:

; assembly for float slowWithRestrict( )
; pointer a is on r3. pointer b is on r4.
lis r10,__real@40e00000 ; load address of constant "7.0f" to r10
lis r11,__real@41400000 ; load address of constant "12.0f" to r11
li  r9,5                ; set r9 to "5"
stw r9,0(r3)            ; store r9 through pointer a
lfs fr0,__real@40e00000(r10) ; load constant "7.0" from memory onto fr0
lfs fr1,__real@41400000(r11) ; load constant "12.0" from memory onto fr1
stfs fr0,0(r4)          ; store fr0 through pointer b
blr                     ; return fr1

No load-hit-stores at all.

The load-hit-store is one of those quirky CPU implementation details that can cause significant performance problems in high-level code without it really being clear why, especially on in-order cores such as the PowerPCs inside the 360 and PS3. It’s especially insidious because it’s exactly the sort of thing we expect our compilers to transparently handle for us, and yet the compiler can’t handle correctly without explicit hinting from the programmer. Fortunately, once you know what’s going on underneath the hood, there is a C++ keyword that helps avoid this problem without having to resort to inline assembly and darker arts.

Basically, a load-hit-store is a large stall that occurs when the CPU writes data to an address x and then tries to load that data from x again too soon afterwards. The reason for this problem has to do with the deep instruction pipeline of a modern processor, but the consequence is clear: the processor’s execution comes to a complete halt for between 40 and 80 cycles. The simplest way to think of it is is that the data has to “bounce off of the L1 cache”.  That is, when you write data from a CPU register to memory, the ”store” operation has to complete, writing the data all the way out to main memory, before the data can be read back into a register again. Here’s a trivial example in C:

int CauseLHS(int *ptrA)
{
   int a, b;
   int *ptrB = ptrA; // B and A point to the same address
   *ptrA = 5; // write data to address ptrA
   b = *ptrB; // read that data back in again (won't be
                  // available for 80 cycles)
   a = b + 10; // stalls! the data isn't available yet.
}

This seems like the sort of thing the compiler should notice and fix by simply keeping the contents of *ptrA in a register, but in most real-world cases the compiler won’t be able to tell that ptrA and ptrB point to the same address. Thus, it’s obliged to read memory back from a pointer every time you dereference it, because any other pointer in the function might have aliased and modified that data. There is, however, a way to help out the compiler a little: the __restrict keyword.

__restrict is a compiler directive that helps avoid load-hit-store stalls.

__restrict on a pointer promises the compiler that it has no aliases: nothing else in the function points to that same data. Thus the compiler knows that if it writes data to a pointer, it doesn’t need to read it back into a register later on because nothing else could have written to that address. Without __restrict, the compiler is forced to read data from every pointer every time it is used, because another pointer may have aliased x.

For example, this code will run slowly:

int slow(int *a, int *b)
{
   *a = 5;
   *b = 7;
   return *a + *b; // LHS stall: the compiler doesn't
                  // know whether a == b, so it has to
                  //  reload both before the add
}

Whereas this code will run quickly:

int fast(int * __restrict a, int * __restrict b)
{
   *a = 5;
   *b = 7; // RESTRICT promises that a != b
   return *a + *b; // no stall; the compiler hangs onto
                         // 5 and 7 in the registers.
}

It bears repeating that __restrict is a promise you make to your compiler. If you break your promise, you can get incorrect results. If pointers pA and pB are __restrict, and pA == pB, that will cause mysterious bugs.

For example, in the fast() function above, if a does equal b, then the correct return value would be 14. However, the compiler won’t know that *b = 7 has changed the value of *a, and so you might actually get a return value of 12.

When working with C++ member functions, you can tell the compiler that the implicit this pointer is __restrict like so:

struct CFoo
{
  int MemberFunc();
  int m_iBar;
}
int CFoo::MemberFunc( int *a ) __restrict
{
  m_iBar = 5;
  *a = 7; // __restrict THIS promises that a != &m_iBar
  return m_iBar + 12; // no load-hit-store
}

There is unfortunately no way to mark a C++ reference as __restrict, so function parameters declared like int foo(int &a, int &b) cannot benefit from __restrict. In those cases, either copy a and b to local variables inside your function, then write the final values back out again at the end; or change your function signature to use pointers instead.

In summary, __restrict makes code involving pointers faster, so long as the pointers never alias one another. It is the usual first step when your profiler complains that a function has too many load-hit-stores, but it is not magic: it depends on the programmer’s attention to where all the data is going, and on that no-alias promise being kept.

See also the followup to this article here.

My friend Cort pointed me at this interesting Developing for Developers article on how to make integer division faster when the denominator is a fixed constant. As Cort says, if your processor lacks a dedicated integer divide op, and

…you want to divide an arbitrary X by an arbitrary Y, long division is pretty much the only way to go. However, if the divisor is known at compile time, there is very often a much more efficient way to compute the quotient. The simplest example, familiar to most students of computer science, is the good old x / 2n = x >> n .

For those who are curious, here’s an article that’s chock full of similar tricks for efficient division by other small integers (3, 5, 7, 9, 10, 11, etc.). Even the non-assembly programmers in the audience might benefit from these, as they’re not as commonly implemented by compiler authors.

These tricks are especially useful on the in-order PowerPC CPUs found in most modern game consoles, where the integer divide operation is microcoded: that is to say, even though it appears to be only a single opcode in the data stream, it actually causes the processor to execute a little subroutine stored in ROM. This means it stops the CPU pipeline dead in its tracks until the microcode is done executing (about twenty cycles): until the divide is finished, the processor can do nothing else.

In cases where a divide by a small constant can be replaced by five or six pipelined instructions, this can be a big speed benefit: first, because those five instructions will probably execute faster than the long-division stored in microcode; and second, because the compiler can interleave other useful work in between the dependent operations of the divide. (If that other work is on, say, the load or vector pipeline, it may even dual-dispatch and get you two operations per clock cycle.)

Derrick Coetzee hasn’t updated Developing for Developers recently, but it looks like there’s quite a few interesting articles in the archives worth checking out.

From Half-Life 2 to The Orange Box 

The slides for my talk at the 2008 Game Developers’ Conference, “How To Go From PC To Console Development Without Killing Your Studio,” are now available online at the Valve website. 

The talk describes some of the many pitfalls that face developers who are used to developing their games exclusively for the home personal computer, but are now about to embark upon their first crossplatform title, shipping on both PC and console at the same time. These pitfalls have brought some studios to their doom, but fear not: I also provide techniques for how to deal with each of these issues, without having to rewrite most of your game from scratch.  Some of the techniques are production issues, like dealing with certification. Others address programming problems like improving framerate on the PowerPC — some assembly is required.

My speaker’s notes are embedded in the slides: hover over or click the small dialog bubble that Adobe Acrobat superimposes at the top left.

I’ve also started a Q&A page to address questions people have asked me about this talk, either in person at the conference, or through email since then. If you have questions of your own, please feel free to leave a comment.