Under Reconstruction

Due to a failure at my previous webhost, I’ve moved Some Assembly Required to a new hosting service. Most of the text has come over, but a lot of the images and media need to be restored manually. Leave a comment hereĀ if you find anything missing.

“AI-Driven Dynamic Dialog” at GDC 2012

At last week’s Game Developers’ Conference I delivered a talk titled “AI-driven Dynamic Dialog”, describing the dialog system used in Left4Dead, Dota, and basically all of Valve’s games since The Orange Box.Summary graphic of voice-rules database

A PDF export of the slides for my talk is now available here. I’ve also created a support page where I’ll gather all information, bibliography, and followup associated with that particular talk.

I’ve also posted the the videos mentioned in the slides (since obviously they can’t be embedded into a PDF):

  1. Two Bots: Environment-Aware Speech
  2. Two Bots: Starting a Conversation
  3. Two Bots: Memory and Context
  4. Left4Dead2: Variety
  5. Left4Dead2: Automatic Barks
  6. Left4Dead2: Environmentally Triggered Dialog

Thanks to everyone who stuck with me through late flights and laptop failures to the last session of GDC 2012!

Why is it called a “core dump” anyway?

Star Trek: TNG Warp Core BreachThe diagnostic file emitted by a crashing process in a modern operating system can contain a variety of useful information, including exception type, current instruction, CPU state, call stack, and sometimes the entire contents of the current thread’s stack or even the entire process heap. So why is it called a “core dump”?

For years I thought this was an amusing Star Trek reference by the original implementors of UNIX, after all the episodes in which the Enterprise‘s reactor threatens to explode and Geordi has to save them by “dumping the warp core,” but it turns out the actual explanation is much more prosaic.
Read the rest of this entry »

Annotated slides for GDC11 “Forensic Debugging”

Angry-looking Mars The annotated slides for my GDC talk on Forensic Debugging and Crash Analysis, containing my speaker’s notes and some narration, are now available for download in PDF format here. The PowerPoint is on the GDC Vault as well.

Forensic Debugging: Thanks Are Due

While debugging a smashed stack may seem like a heroic feat, the most heroic thing about my talk is the amount of time, effort, and care my friends spent to help me put it together. I would never have made it to the GDC, let alone made any sense whatsoever onstage, without the support of all my friends inside and outside Valve.

A special badge of courage is due those those who bravely offered to sit through my rehearsals, gave me details for slides, or in some other way helped distill ninety minutes of inane gibbering into one hour of assembly and win:

  • Jeep Barnett
  • Dan Berger
  • Iestyn Bleasdale-Shepherd
  • Bank Charnchaichujit
  • John Cook
  • Kerry Davis
  • Bruce Dawson
  • Michelle Garrison
  • Bronwen Grimes
  • Dave Kircher
  • Tejeev Kohli
  • Joe Ludwig
  • Jason Mitchell
  • Kyle Monroe
  • Marc Nagel
  • Olivier Nallet
  • Alfred Reynolds
  • Dave Riller
  • Mike Sartain
  • Dave Saunders

Thanks, guys and gals — I wouldn’t have done it without you.

Preliminary GDC slides posted

Thanks to everyone who came to my “Forensic Debugging” talk at the 2011 Game Developers’ Conference. I hope it was valuable to all who attended.

The lecture covered a great deal of ground in a short time, and so the slides necessarily had to go by rather quickly. Eventually a video of my presentation will be at the GDC vault. In the meantime, I’ve exported most of the deck as a series of annotated PDF images here, to help fill in the notes of anyone who might have attended but missed a point or two.

The intent of my talk was to give a general overview of the forensic mindset, the tools available, and demonstrate that rather than being a dark art, the science of crash analysis is something that everyone can learn.

Square Roots in vivo: normalizing vectors

Following my earlier article on timing various square-root functions on the x86, commenter LeeN suggested that it would be useful to also test their impact on a more realistic scenario than square-rooting long arrays of independent numbers. In real gameplay code the most common use for sqrts is in finding the length of a vector or normalizing it, like when you need to perform a distance check between two characters to determine whether they can see/shoot/etc each other. So, I wrote up a group of normalize functions, each using a different sqrt technique, and timed them.

The testbed was, as last time, an array of 2048 single-precision floating point numbers, this time interpreted as a packed list of 682 three-dimensional vectors. This number was chosen so that both it and the output array were sure to fit in the L1 cache; however, because three floats add up to twelve bytes, this means that three out of four vectors were not aligned to a 16-byte boundary, which is significant for the SIMD test case as I had to use the movups unaligned load op. Each timing case consisted of looping over this array of vectors 2048 times, normalizing each and writing the result to memory.

Each normalize function computed the length of the vector 1/√(x2 + y2 + z2), multiplied each component by the reciprocal, and then wrote it back through an output pointer. The main difference was in how the reciprocal square root was computed:

  • via the x87 FPU, by simply compiling 1.0f/sqrt( x*x + y*y + z*z )
  • via the SSE scalar unit, by compiling 1.0f/sqrt( x*x + y*y + z*z ) with the /arch:SSE2 option set; this causes the compiler to issue a sqrtss followed by an fdivie, it computes the square root and then divides one by it
  • via the SSE scalar unit, by using the estimated reciprocal square root intrinsic and then performing one step of Newton-Raphson iteration
  • via the SSE SIMD unit, working on the whole vector at once

In all cases the results were accurate to 22 bits of precision. The results for 1,396,736 vector normalizations were:

Method Total time Time per vector
Compiler 1.0/sqrt(x)
x87 FPU FSQRT
52.469ms 37.6ns
Compiler 1.0/sqrt(x)
SSE scalar sqrtss
27.233ms 19.5ns
SSE scalar ops
rsqrtss with one NR step
21.631ms 15.5ns
SSE SIMD ops
rsqrtss with one NR step
20.034ms 14.3ns

Two things jump out here. First, even when the square root op is surrounded by lots of other math — multiplies, adds, loads, stores — optimizations such as this can make a huge difference. It’s not just the cost of the sqrt itself, but also that it’s unpipelined, which means it ties up an execution unit and prevents any other work from being done until it’s entirely completed.

Second, in this case, SIMD is only a very modest benefit. That’s because the input vectors are unaligned, and the two key steps of this operation, the dot product and the square root, are scalar in nature. (This is what’s meant by “horizontal” SIMD computation — operations between the components of one vector, rather than between the corresponding words of two vectors. Given a vector V ∋ <x,y,z>, the sum x + y + z is horizontal, but with two vectors V1 and V2, V3 = <x1+x2, y1+y2, z1+z2> is vertical.) So it really doesn’t play to SIMD’s strengths at all.

On the other hand, if I were to normalize four vectors at a time, so that four dot products and four rsqrts could be performed in parallel in the four words of a vector register, then the speed advantage of SIMD would be much greater. But, again, my goal wasn’t to test performance in tight loops over packed data — it was to figure out the best way to do something like an angle check in the middle of a character’s AI, where you usually deal with one vector at a time.

Source code for my testing functions below the jump. Note that each function writes the normalized vector through an out pointer, but also returns the original vector’s length. The hand-written intrinsic versions probably aren’t totally optimal, but they ought to be good enough to make the point.
Read the rest of this entry »

Timing square root

The square root is one of those basic mathematical operations that’s totally ubiquitous in any game’s source code, and yet also has many competing implementations and performance superstitions around it. The compiler offers a sqrt() builtin function, and so do some CPUs, but some programmers insist on writing their own routines in software. And often it’s really the reciprocal square root you want, for normalizing a vector, or trigonometry. But I’ve never had a clear answer for which technique is really fastest, or exactly what accuracy-vs-speed tradeoffs we make with “estimating” intrinsics.

What is the fastest way to compute a square root? It would seem that if the CPU has a native square-root opcode, there’s no beating the hardware, but is it really true?

Such questions vex me, so I went and measured all the different means of computing the square root of a scalar single-precision floating point number that I could think of. I ran trials on my Intel Core 2 and on the Xenon, comparing each technique for both speed and accuracy, and some of the results were surprising!

In this article I’ll describe my results for the Intel hardware; next week I’ll turn to the Xenon PPC.

Experimental setup

I’ll post the whole source code for my tests elsewhere, but basically each of these trials consists of iterating N times over an array of floating point numbers, calling square root upon each of them and writing it to a second output array.

[DDET (see pseudocode)]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
inline float TestedFunction( float x ) 
{
   return sqrt(x); // one of many implementations..
}
void TimeSquareRoot()
{
   float numbersIn[ ARRAYSIZE ];    // ARRAYSIZE chosen so that both arrays 
   float numbersOut[ ARRAYSIZE ];  // fit in L1 cache
   // assume that numbersIn is filled with random positive numbers, and both arrays are 
   // prefetched to cache...
   StartClockCycleCounter();
   for ( int i = 0 ; i < NUMITERATIONS ; ++i ) 
      for ( int j = 0 ; j < ARRAYSIZE ; ++j ) // in some cases I unroll this loop
      {
         numbersOut[j] = TestedFunction( numbersIn[j] ); 
      }
   StopClockCycleCounter();
   printf( "%.3f millisec for %d floatsn",  
             ClockCycleCounterInMilliseconds(), ARRAYSIZE * NUMITERATIONS ); 
 
   // now measure accuracy
   float error = 0;
   for ( int i = 0 ; i < ARRAYSIZE ; ++i )
   {
       double knownAccurate = PerfectSquareRoot( numbersIn[i] );
       error += fabs( numbersOut[i] - knownAccurate  ) / knownAccurate ;
   }
   error /= ARRAYSIZE ;
   printf( "Average error: %.5f%%n", error * 100.0f );
}

[/DDET]

In each case I verified that the compiler was not eliding any computations (it really was performing ARRAYSIZE × NUMITERATIONS many square roots), that it was properly inlining the tested function, and that all the arrays fit into L1 cache so that memory latency wasn’t affecting the results. I also only tested scalar square root functions — SIMD would clearly be the fastest way of working on large contiguous arrays, but I wanted to measure the different techniques of computing one square root at a time, as is usually necessary in gameplay code.

Because some of the speedup techniques involve trading off accuracy, I compared the resulting numbers against the perfectly-accurate double-precision square root library routine to get an average error for each test run.

And I performed each run multiple times with different data, averaging the final results together.

x86 results

I ran my tests on a 2.66Ghz Intel Core 2 workstation. An x86 chip actually has two different means of performing scalar floating-point math. By default, the compiler uses the old x87 FPU, which dates back to 1980 with a stack-based instruction set like one of those old RPN calculators. In 1999, Intel introduced SSE, which added a variety of new instructions to the processor. SSE is mostly thought of as a SIMD instruction set — for operating on four 32-bit floats in a single op — but it also includes an entire set of scalar floating point instructions that operate on only one float at a time. It’s faster than the x87 operations and was meant to deprecate the old x87 pathway. However, both the MSVC and GCC compilers default to exclusively using the x87 for scalar math, so unless you edit the “code generation” project properties panel (MSVC) or provide a cryptic obscure command line option (GCC), you’ll be stuck with code that uses the old slow way.

I timed the following techniques for square root:

  1. The compiler’s built in sqrt() function (which compiled to the x87 FSQRT opcode)
  2. The SSE “scalar single square root” opcode sqrtss, which MSVC emits if you use the _mm_sqrt_ss intrinsic or if you set /arch:SSE2
  3. The “magic number” approximation technique invented by Greg Walsh at Ardent Computer and made famous by John Carmack in the Quake III source code.
  4. Taking the estimated reciprocal square root of a via the SSE opcode rsqrtss, and multiplying it against a to get the square root via the identity x / √x = √x.
  5. Method (4), with one additional step of Newton-Raphson iteration to improve accuracy.
  6. Method (5), with the loop at line 13 of the pseudocode above unrolled to process four floats per iteration.

I also tested three ways of getting the reciprocal square root: Carmack’s technique, the rsqrtss SSE op via compiler intrinsic, and rsqrtss with one Newton-Raphson step.

The results, for 4096 loops over 4096 single-precision floats, were:

SQUARE ROOT

Method Total time Time per float Avg Error
Compiler sqrt(x) /
x87 FPU FSQRT
404.029ms 24ns 0.0000%
SSE intrinsic ssqrts 200.395ms 11.9ns 0.0000%
Carmack’s Magic Number rsqrt * x 72.682ms 4.33ns 0.0990%
SSE rsqrtss * x 20.495ms 1.22ns 0.0094%
SSE rsqrtss * x
with one NR step
53.401ms 3.18ns 0.0000%
SSE rsqrtss * x
with one NR step, unrolled by four
48.701ms 2.90ns 0.0000%

RECIPROCAL SQRT

Method Total time Time per float Avg Error
Carmack’s Magic Number rsqrt 59.378ms 3.54ns 0.0990%
SSE rsqrtss 14.202ms 0.85ns 0.0094%
SSE rsqrtss
with one NR step
45.952ms 2.74ns 0.0000%

Discussion

Looking at these results, it’s clear that there’s a dramatic difference in performance between different approaches to performing square root; which one you choose really can have a significant impact on framerate and accuracy. My conclusions are:

Don’t trust the compiler to do the right thing. The received wisdom on performance in math functions is usually “don’t reinvent the wheel; the library and compiler are smart and optimal.” We see here that this is completely wrong, and in fact calling the library sqrt(x) causes the compiler to do exactly the worst possible thing. The compiler’s output for y = sqrt(x); is worse by orders of magnitude compared to any other approach tested here.

The x87 FPU is really very slow. Intel has been trying to deprecate the old x87 FPU instructions for a decade now, but no compiler in the business defaults to using the new, faster SSE scalar opcodes in place of emulating a thirty-year-old 8087. In the case of y = sqrt(x) , by default MSVC and GCC emit something like

fld DWORD PTR [ecx]
fsqrt  ;; slow x87 flop
fstp DWORD PTR [eax]

But if I set the /arch:SSE2 option flag, telling the compiler “assume this code will run on a machine with SSE2″, it will instead emit the following, which is 2x faster.

sqrtss xmm0, DWORD PTR [ecx]  ;; faster SSE scalar flop
movss DWORD PTR [eax], xmm0

There was a time when not every PC on the market had SSE2, meaning that there was some sense in using the older, more backwards-compatible operations, but that time has long since passed. SSE2 was introduced in 2001 with the Pentium 4. No one is ever going to try to play your game on a machine that doesn’t support it. If your customer’s PC has DirectX 9, it has SSE2.

You can beat the hardware. The most surprising thing about these results for me was that it is faster to take a reciprocal square root and multiply it, than it is to use the native sqrt opcode, by an order of magnitude. Even Carmack’s trick, which I had assumed was obsolete in an age of deep pipelines and load-hit-stores, proved faster than the native SSE scalar op. Part of this is that the reciprocal sqrt opcode rsqrtss is an estimate, accurate to twelve bits; but it only takes one step of Newton’s Method to converge that estimate to an accuracy of 24 bits while still being four times faster than the hardware square root opcode.

The question that then bothered me was, why is SSE’s built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations? The first hint came when I tried unrolling the loop so that it performed four ops inside the inner for():

   for ( int i = 0 ; i < NUMITERATIONS ; ++i ) 
      for ( int j = 0 ; j < ARRAYSIZE ; j += 4  ) // in some cases I unroll this loop
      {
         numbersOut[j + 0] = TestedSqrt( numbersIn[j + 0] );
         numbersOut[j + 1] = TestedSqrt( numbersIn[j + 1] ); 
         numbersOut[j + 2] = TestedSqrt( numbersIn[j + 2] ); 
         numbersOut[j + 3] = TestedSqrt( numbersIn[j + 3] );  
      }
 
// two implementations of

As you can see from the results above, when TestedSqrt was the rsqrtss followed by a multiply and one step of Newton iteration, unrolling the loop this way provided a modest 8.8% improvement in speed. But when I tried the same thing with the “precise square root” op sqrtss, the difference was negligible:

SSE sqrt: 200.395 msec
average error 0.0000%

SSE sqrt, unrolled four: 196.741 msec
average error 0.0000%

What this suggests is that unrolling the loop this way allowed the four rsqrt paths to be pipelined, so that while an individual rsqrtss might take 6 cycles to execute before its result was ready, other work could proceed during that time so that the four square root operations in the loop overlapped. On the other hand, the non-estimated sqrtss op apparently cannot be overlapped; one sqrt must finish before the next can begin. A look at the IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual confirms: sqrtss is an unpipelined instruction.

Pipelined operations make a big difference. When the CPU hits an unpipelined instruction, every other instruction in the pipeline has to stop and wait for it to retire before proceeding, so it’s like putting the handbrake on your processor. You can identify nonpipelined operations in appendix C of the Optimization Reference Manual as the ones that have a throughput equal to latency and greater than 4 cycles.

In the case of ssqrt, the processor is probably doing the same thing internally that I’m doing in my “fast” function — taking an estimated reciprocal square root, improving it with Newton’s method, and then multiplying it by the input parameter. Taken all together, this is far too much work to fit into a single execution unit, so the processor stalls until it’s all done. But if you break up the work so that each of those steps is its own instruction, then the CPU can pipeline them all, and get a much higher throughput even if the latency is the same.

Pipeline latency and microcoded instructions are a much bigger deal on the 360 and PS3, whose CPUs don’t reorder operations to hide bubbles; there the benefit from unrolling is much greater, as you’ll see next week.

Conclusion

Not all square root functions are created equal, and writing your own can have very real performance benefits over trusting the compiler to optimize your code for you (at which it fails miserably). In many cases you can trade off some accuracy for a massive increase in speed, but even in those places where you need full accuracy, writing your own function to leverage the rsqrtss op followed by Newton’s method can still give you 32 bits of precision at a 4x-8x improvement over what you will get with the built-in sqrtf() function.

And if you have lots of numbers you need to square root, of course SIMD (rsqrtps) will be faster still.

Code For Testing Virtual Function Speed

I’ve just updated my prior article on virtual function overhead with corrected timing numbers — I hadn’t noticed that my CPU cycle counts were only 32 bits wide so timings of more than 2secs would wrap back around to zero.

If you want to run this test on your own hardware, I’ve put my code below the jump. You’ll have to build your own CFastTimer class, but it should be pretty clear what it does — it simply reads out of the CPU clock-cycle counter and computes a difference.

Read the rest of this entry »

How Slow Are Virtual Functions Really?

Whenever I work with virtual functions I find myself wondering: how much is it costing me to perform all these vtable lookups and indirect calls? The usual truism is that computers are so fast now that it doesn’t matter and that the idea of virtuals being a problem is just another myth.

Our beloved Xenon CPU is in-order, however, so I got curious whether that myth is truly busted for us, and as any Mythbuster can tell you, the only way to know is to build it and test!

I’ll talk about the test results first and then try to explain them in a later article. I built a simple 4-dimensional vector class with accessor functions for x,y,z, and w. Then I set up three arrays (A, B, C) each containing 1024 of these classes (so everything fits into the L1 cache) and ran a loop that simply added them together one component at a time.

class Vector4Test {
  float x,y,z,w;
public:
  float GetX() { return x; ]
  float SetX( float x_ ) { return x=x_; }
  // and so on
}
Vector4Test A[1024], B[1024], C[1024];
 
for (int n = 0 ; n = NUM_TESTS ; ++n)
for (int i=0; i < 1024 ; ++i) {
   C[i].SetX( A[i].GetX + () B[i].GetX();
   // and so on for y, z, and w
}

By specifying whether the Get and Set functions are inline, direct, or virtual, it’s easy to compare the overhead of one kind of function call versus another. Each run through the loop would make three function calls per component times four components times 1024 elements in the array for a total of 12,288 function calls. The inline function is essentially the control group since it measures just the cost of the memory accesses, loop conditionals, and floating-point math without any function call overhead at all. Here’s the results:

NOTE: The values below have been corrected from the first version of this post. See this comment for details.


1000 iterations over 1024 vectors
12,288,000 function calls
virtual: 159.856 ms
direct: 67.962 ms
inline: 8.040 ms

 

50000 iterations over 1024 vectors
614,400,000 function calls
virtual: 8080.708 ms
direct: 3406.297 ms
inline: 411.924 ms

A couple of things are immediately obvious. First, virtual functions are slower than direct function calls. But by how much? In the upper trial, the virtual-function test took 91.894ms longer than the direct functions; divided by the 12.288×106 function calls, that works out a differential overhead of about 7 nanoseconds. So, there is a definite cost there, but probably not something to worry about unless it’s a function that gets called thousands of times per frame.

Later I’ll get further into the causes of these disparities, why virtual functions are slower than direct calls, and when inlining is advantageous. In the meantime I can tell you for sure that the problem is not the cost of looking up the indirect function pointer from the vtable — that’s only a single unproblematic load operation. Rather the issues lie in branch prediction and the way that marshalling parameters for the calling convention can get in the way of good instruction scheduling.