FPU “fun”
This small document summarizes various FPU issues I recently ran into.
1) FPU rounding modes
On the PC, the FPU has four different rounding modes. Those modes determine the behavior of float-to-int conversions through the fist or fistp functions:
|
Chop (“Floor”) |
Up (“Ceil”) |
Down |
Near (“Best”) |
1.2 |
1 |
2 |
1 |
1 |
1.6 |
1 |
2 |
1 |
2 |
-1.2 |
-1 |
-1 |
-2 |
-1 |
-1.6 |
-1 |
-1 |
-2 |
-2 |
You can select a particular rounding mode by including <float.h> and using one of the following calls:
_controlfp(_RC_CHOP, _MCW_RC);
_controlfp(_RC_UP, _MCW_RC);
_controlfp(_RC_DOWN, _MCW_RC);
_controlfp(_RC_NEAR, _MCW_RC);
The default FPU rounding mode is “Near”. If you never bothered about the FPU state before, it’s likely that your programs all run in this mode.
2) The float-to-int problem:
This is a very classical problem that has been around since VC6, and probably before. It has to do with the way float-to-int C-style casts are compiled.
Compile this in VC6:
volatile float f = 1.5f;
int i = (int)f;
You get this:
10: volatile float f = 1.5f;
00401001 mov dword ptr [esp],3FC00000h
11: int i = (int)f;
00401009 fld dword ptr [esp]
0040100D call __ftol (004010c4)
Trace into __ftol() and you’ll see this:
__ftol:
004010C4 push ebp
004010C5 mov ebp,esp
004010C7 add esp,0FFFFFFF4h
004010CA wait
004010CB fnstcw word ptr [ebp-2]
004010CE wait
004010CF mov ax,word ptr [ebp-2]
004010D3 or ah,0Ch
004010D6 mov word ptr [ebp-4],ax
004010DA fldcw word ptr [ebp-4]
004010DD fistp qword ptr [ebp-0Ch]
004010E0 fldcw word ptr [ebp-2]
004010E3 mov eax,dword ptr [ebp-0Ch]
004010E6 mov edx,dword ptr [ebp-8]
004010E9 leave
004010EA ret
So, the code stores the FPU control word (fnstcw), changes the rounding mode to “floor” (fldcw), does the actual cast (fistp), then restores the control word (fldcw). This is done for each cast, and of course it is very slow.
In VC7, the same code gives birth to this:
volatile float f = 1.5f;
00401093 mov dword ptr [esp],3FC00000h
int i = (int)f;
0040109A fld dword ptr [esp]
0040109D call _ftol2 (4013D4h)
And _ftol2() is like this:
_ftol2:
004013D4 push ebp
004013D5 mov ebp,esp
004013D7 sub esp,20h
004013DA and esp,0FFFFFFF0h
004013DD fld st(0)
004013DF fst dword ptr [esp+18h]
004013E3 fistp qword ptr [esp+10h]
004013E7 fild qword ptr [esp+10h]
004013EB mov edx,dword ptr [esp+18h]
004013EF mov eax,dword ptr [esp+10h]
004013F3 test eax,eax
004013F5 je integer_QnaN_or_zero (401433h)
arg_is_not_integer_QnaN:
004013F7 fsubp st(1),st
004013F9 test edx,edx
004013FB jns positive (40141Bh)
004013FD fstp dword ptr [esp]
00401400 mov ecx,dword ptr [esp]
00401403 xor ecx,80000000h
00401409 add ecx,7FFFFFFFh
0040140F adc eax,0
00401412 mov edx,dword ptr [esp+14h]
00401416 adc edx,0
00401419 jmp localexit (401447h)
positive:
0040141B fstp dword ptr [esp]
0040141E mov ecx,dword ptr [esp]
00401421 add ecx,7FFFFFFFh
00401427 sbb eax,0
0040142A mov edx,dword ptr [esp+14h]
0040142E sbb edx,0
00401431 jmp localexit (401447h)
integer_QnaN_or_zero:
00401433 mov edx,dword ptr [esp+14h]
00401437 test edx,7FFFFFFFh
0040143D jne arg_is_not_integer_QnaN (4013F7h)
0040143F fstp dword ptr [esp+18h]
00401443 fstp dword ptr [esp+18h]
localexit:
00401447 leave
00401448 ret
The code looks longer but it is actually faster than __ftol() since usually only the first part is executed. This one doesn’t touch the FPU control word, but still works regardless of the FPU state. It’s faster than VC6, but it’s again a function call for each cast. Ideally we would like to remove the function call and use the fist instruction directly.
3) /QIfist
The typical solution to enable fast casts is to use the /QIfist compilation flag, which works both in VC6 and VC7. The compiled code then becomes:
10: volatile float f = 1.6f;
00401003 mov dword ptr [esp],3FCCCCCDh
11: int i = (int)f;
0040100B fld dword ptr [esp]
0040100F fistp qword ptr [esp]
00401013 mov eax,dword ptr [esp]
We see that all calls to external functions disappeared, and the cast is directly performed through the “fistp” instruction, inlined within the main code. The problem with this is that the result depends on the current FPU rounding mode. For example with the 1.6 floating-point value, the cast would typically give 2 instead of 1 here, because the default FPU rounding mode is “best” instead of “floor”.
So one usual solution is to first setup the FPU rounding mode to “floor”, at the start of your program, and let /QIfist handle all the fast casts afterwards. This works:
_controlfp(_RC_CHOP, _MCW_RC);
volatile float f = 1.6f;
int i = (int)f;
An alternative to /QIfist is to use your own cast function, like this:
__forceinline int MyCast(float f)
{
int i;
_asm fld f
_asm fistp i
return i;
}
But it’s not really better than /QIfist, just a different way to get the same result.
So, is that all? Well, no, the real problem starts now.
4) Internal rounding mode
It appears that the FPU rounding mode also changes the results of internal FPU computations, even when no float-to-int conversion is involved. Check this out:
volatile float Size = 512.0f;
float Val = logf(Size) / logf(2.0);
printf("%f\n", Val);
In Debug mode, the result is going to be different with a “near” mode or a “floor” mode. Note that there’s apparently no problem in Release. In any case, it proves the rounding mode can have a visible impact on FPU computations, even when no float-to-int cast is involved.
This is a very strong result. It means all our previous strategies to enable fast casts are actually dangerous, and might have unexpected side effects everywhere, as soon as some FPU computation is involved.
At this point, I didn’t find a good way to have both the fast casts and the “safe” rounding mode, i.e. the default one, “near”.
5) Threads
Threads introduce another issue: the FPU state is saved and restored during each context switching. So the FPU state is thread-dependent, and one must make sure it’s correctly setup in each thread.
Pierre Terdiman