The feedback I got on yesterday’s article on float-to-int conversion prompted me to look more closely into all the different options MSVC actually gives you for rounding on the x86 architecture. It turns out that with /fp:fast set it can do one of three things (in addition to the magic-number rounding you can write yourself):

  • By default it will call a function _ftol2_sse, which tests the CPU to see if it has SSE2 functionality. If so, it uses the native SSE2 instruction cvttsd2si. If not, it calls _ftol(). This is quite slow because it has to perform that CPU test for every single conversion, and because there is that overhead of a function call.
  • With /QIfist specified, the compiler simply emits a fistp opcode to convert the x87 floating point register to an integer in memory directly. It uses whatever rounding mode happens to be set in the CPU at the moment.
  • With /arch:SSE2 specified, the compiler assumes that the program will only run on CPUs with SSE2, so it emits the cvttsd2si opcode directly instead of calling _ftol2_sse. Like /QIfist, this replaces a function call with a single instruction, but it’s even faster and not deprecated. As commenter cb points out, the intrinsics also let you specify truncation or rounding without having to fool around with CPU modes.

I raced the different techniques against each other and the clear winner was the function compiled with /arch:SSE2 set. Thus, if you can assume that your customer will have a CPU with SSE2 enabled, setting that simple compiler switch will provide you with superior performance for basically no work. The only caveat is that the SSE scalar operations operate at a maximum of double-precision floats, whereas the old x87 FPU instructions are internally 80-bit — but I’ve never seen a game application where that level of precision makes a difference.

According to the Steam Hardware Survey, 95% of our customers have SSE2-capable CPUs. The rest are probably not playing your most recent releases anyway.

Comparison of rounding speeds
8 trials of 1.024*108 floats on a Core2
/fp:fast magic number /arch:sse2 /QIfist
312.944ms 184.534ms 96.978ms 178.732ms
314.255ms 182.105ms 91.390ms 178.363ms
311.359ms 181.397ms 89.606ms 182.709ms
309.149ms 181.023ms 87.732ms 180.485ms
309.828ms 181.405ms 91.891ms 184.785ms
309.595ms 176.970ms 86.886ms 178.501ms
309.081ms 179.109ms 86.885ms 177.811ms
308.208ms 176.873ms 86.796ms 178.051ms


  1. cb says:

    IMO the biggest win from the sse conversions is that you have direct access to round-to-int and trunc-to-int without changing FPU mode or checking for negatives.

  2. Yuhong Bao says:

    SSE3 have FISTTP which allows direct access to truncation mode when using the x87 stack.

Leave a Reply