The load-hit-store is one of those quirky CPU implementation details that can cause significant performance problems in high-level code without it really being clear why, especially on in-order cores such as the PowerPCs inside the 360 and PS3. It’s especially insidious because it’s exactly the sort of thing we expect our compilers to transparently handle for us, and yet the compiler can’t handle correctly without explicit hinting from the programmer. Fortunately, once you know what’s going on underneath the hood, there is a C++ keyword that helps avoid this problem without having to resort to inline assembly and darker arts.

Basically, a load-hit-store is a large stall that occurs when the CPU writes data to an address x and then tries to load that data from x again too soon afterwards. The reason for this problem has to do with the deep instruction pipeline of a modern processor, but the consequence is clear: the processor’s execution comes to a complete halt for between 40 and 80 cycles. The simplest way to think of it is is that the data has to “bounce off of the L1 cache”.  That is, when you write data from a CPU register to memory, the “store” operation has to complete, writing the data all the way out to main memory, before the data can be read back into a register again. Here’s a trivial example in C:

int CauseLHS(int *ptrA)
{
   int a, b;
   int *ptrB = ptrA; // B and A point to the same address
   *ptrA = 5; // write data to address ptrA
   b = *ptrB; // read that data back in again (won't be
                  // available for 80 cycles)
   a = b + 10; // stalls! the data isn't available yet.
}

This seems like the sort of thing the compiler should notice and fix by simply keeping the contents of *ptrA in a register, but in most real-world cases the compiler won’t be able to tell that ptrA and ptrB point to the same address. Thus, it’s obliged to read memory back from a pointer every time you dereference it, because any other pointer in the function might have aliased and modified that data. There is, however, a way to help out the compiler a little: the __restrict keyword.

__restrict is a compiler directive that helps avoid load-hit-store stalls.

__restrict on a pointer promises the compiler that it has no aliases: nothing else in the function points to that same data. Thus the compiler knows that if it writes data to a pointer, it doesn’t need to read it back into a register later on because nothing else could have written to that address. Without __restrict, the compiler is forced to read data from every pointer every time it is used, because another pointer may have aliased x.

For example, this code will run slowly:

int slow(int *a, int *b)
{
   *a = 5;
   *b = 7;
   return *a + *b; // LHS stall: the compiler doesn't
                  // know whether a == b, so it has to
                  //  reload both before the add
}

Whereas this code will run quickly:

int fast(int * __restrict a, int * __restrict b)
{
   *a = 5;
   *b = 7; // RESTRICT promises that a != b
   return *a + *b; // no stall; the compiler hangs onto
                         // 5 and 7 in the registers.
}

It bears repeating that __restrict is a promise you make to your compiler. If you break your promise, you can get incorrect results. If pointers pA and pB are __restrict, and pA == pB, that will cause mysterious bugs.

For example, in the fast() function above, if a does equal b, then the correct return value would be 14. However, the compiler won’t know that *b = 7 has changed the value of *a, and so you might actually get a return value of 12.

When working with C++ member functions, you can tell the compiler that the implicit this pointer is __restrict like so:

struct CFoo
{
  int MemberFunc();
  int m_iBar;
}
int CFoo::MemberFunc( int *a ) __restrict
{
  m_iBar = 5;
  *a = 7; // __restrict THIS promises that a != &m_iBar
  return m_iBar + 12; // no load-hit-store
}

There is unfortunately no way to mark a C++ reference as __restrict, so function parameters declared like int foo(int &a, int &b) cannot benefit from __restrict. In those cases, either copy a and b to local variables inside your function, then write the final values back out again at the end; or change your function signature to use pointers instead.

In summary, __restrict makes code involving pointers faster, so long as the pointers never alias one another. It is the usual first step when your profiler complains that a function has too many load-hit-stores, but it is not magic: it depends on the programmer’s attention to where all the data is going, and on that no-alias promise being kept.

See also the followup to this article here.

9 Comments

  1. realtimecollisiondetection.net - the blog » Posts and links you should have read says:

    [...] Load-hit-stores and the __restrict keyword, Elan Ruskin talks about what a gigantic performance suck load-hit-stores can be on the (pretty [...]

  2. anonymouse says:

    I thought that only compatable types could alias? So for example, an int array can’t alias a float array, and a CFoo* can’t alias it’s own member?

  3. aruslan says:

    Yes, only pointers to compatible types and char* can alias.
    int* can’t alias float*, and this is the reason to use unions when making dirty int-float hacks.

  4. Ruskin says:

    Yeah, unfortunately that’s one of the many gulfs between what the C++ spec says should happen and what does happen in the compiler. The code in my other entry is what actually came out of VC++; I guess it’s more paranoid and assumes the programmer might have aliased any pointer to any other unless you tell it otherwise.

    GCC on the PS3 does support strict aliasing rules, I’m told.

  5. Robert says:

    Actually, it’s assumed that any pointer can alias a variable of any type unless you compile with a strict-aliasing flag. Look up type-punning.

    Here’s a good article: http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html

  6. Foo Bar says:

    Uhm, what about x86?

  7. Elan says:

    “x86″ is a very large family of processors with very different implementations. You can make a prediction about how an Intel Core Duo might behave, but it wouldn’t necessarily be true about an i7 or an AMD chip, because their internal pipelines are dissimilar. The upshot is you need to run your own timings on the processor you’re targeting.

    One general prediction I can make is that every x86 processor I’ve timed seems to have a huge latency when shuffling data between x87 float and general-purpose registers.

  8. Aliasing, the silent killer » AltDevBlogADay (Staging Site) says:

    [...] have been changed somewhere else. This cause a “Load hits Store” (More about it here : http://assemblyrequired.crashworks.org/2008/07/08/load-hit-stores-and-the-__restrict-keyword/). The memory containing item_index is written just before the same memory is read. The CPU just [...]

Leave a Reply