More About Rounding On MSVC-x86
The feedback I got on yesterday’s article on float-to-int conversion prompted me to look more closely into all the different options MSVC actually gives you for rounding on the x86 architecture. It turns out that with /fp:fast
set it can do one of three things (in addition to the magic-number rounding you can write yourself):
- By default it will call a function
_ftol2_sse
, which tests the CPU to see if it has SSE2 functionality. If so, it uses the native SSE2 instruction cvttsd2si. If not, it calls _ftol(). This is quite slow because it has to perform that CPU test for every single conversion, and because there is that overhead of a function call. - With
/QIfist
specified, the compiler simply emits afistp
opcode to convert the x87 floating point register to an integer in memory directly. It uses whatever rounding mode happens to be set in the CPU at the moment. - With
/arch:SSE2
specified, the compiler assumes that the program will only run on CPUs with SSE2, so it emits the cvttsd2si opcode directly instead of calling_ftol2_sse
. Like /QIfist, this replaces a function call with a single instruction, but it’s even faster and not deprecated. As commenter cb points out, the intrinsics also let you specify truncation or rounding without having to fool around with CPU modes.
I raced the different techniques against each other and the clear winner was the function compiled with /arch:SSE2
set. Thus, if you can assume that your customer will have a CPU with SSE2 enabled, setting that simple compiler switch will provide you with superior performance for basically no work. The only caveat is that the SSE scalar operations operate at a maximum of double-precision floats, whereas the old x87 FPU instructions are internally 80-bit — but I’ve never seen a game application where that level of precision makes a difference.
According to the Steam Hardware Survey, 95% of our customers have SSE2-capable CPUs. The rest are probably not playing your most recent releases anyway.
/fp:fast | magic number | /arch:sse2 | /QIfist |
---|---|---|---|
312.944ms | 184.534ms | 96.978ms | 178.732ms |
314.255ms | 182.105ms | 91.390ms | 178.363ms |
311.359ms | 181.397ms | 89.606ms | 182.709ms |
309.149ms | 181.023ms | 87.732ms | 180.485ms |
309.828ms | 181.405ms | 91.891ms | 184.785ms |
309.595ms | 176.970ms | 86.886ms | 178.501ms |
309.081ms | 179.109ms | 86.885ms | 177.811ms |
308.208ms | 176.873ms | 86.796ms | 178.051ms |
IMO the biggest win from the sse conversions is that you have direct access to round-to-int and trunc-to-int without changing FPU mode or checking for negatives.
SSE3 have FISTTP which allows direct access to truncation mode when using the x87 stack.