FPU “fun”

 

This small document summarizes various FPU issues I recently ran into.

 

1)     FPU rounding modes

 

On the PC, the FPU has four different rounding modes. Those modes determine the behavior of float-to-int conversions through the fist or fistp functions:

 

 

Chop (“Floor”)

Up (“Ceil”)

Down

Near (“Best”)

1.2

1

2

1

1

1.6

1

2

1

2

-1.2

-1

-1

-2

-1

-1.6

-1

-1

-2

-2

 

You can select a particular rounding mode by including <float.h> and using one of the following calls:

 

      _controlfp(_RC_CHOP,    _MCW_RC);

 

      _controlfp(_RC_UP,      _MCW_RC);

 

      _controlfp(_RC_DOWN,    _MCW_RC);

 

      _controlfp(_RC_NEAR,    _MCW_RC);

 

The default FPU rounding mode is “Near”. If you never bothered about the FPU state before, it’s likely that your programs all run in this mode.

 

 

2)     The float-to-int problem:

 

This is a very classical problem that has been around since VC6, and probably before. It has to do with the way float-to-int C-style casts are compiled.

 

Compile this in VC6:

 

            volatile float f = 1.5f;

            int i = (int)f;

 

You get this:

 

10:       volatile float f = 1.5f;

00401001   mov         dword ptr [esp],3FC00000h

11:       int i = (int)f;

00401009   fld         dword ptr [esp]

0040100D   call        __ftol (004010c4)

 

Trace into __ftol() and you’ll see this:

 

__ftol:

004010C4   push        ebp

004010C5   mov         ebp,esp

004010C7   add         esp,0FFFFFFF4h

004010CA   wait

004010CB   fnstcw      word ptr [ebp-2]

004010CE   wait

004010CF   mov         ax,word ptr [ebp-2]

004010D3   or          ah,0Ch

004010D6   mov         word ptr [ebp-4],ax

004010DA   fldcw       word ptr [ebp-4]

004010DD   fistp       qword ptr [ebp-0Ch]

004010E0   fldcw       word ptr [ebp-2]

004010E3   mov         eax,dword ptr [ebp-0Ch]

004010E6   mov         edx,dword ptr [ebp-8]

004010E9   leave

004010EA   ret

 

So, the code stores the FPU control word (fnstcw), changes the rounding mode to “floor” (fldcw), does the actual cast (fistp), then restores the control word (fldcw). This is done for each cast, and of course it is very slow.

 

In VC7, the same code gives birth to this:

 

      volatile float f = 1.5f;

00401093  mov         dword ptr [esp],3FC00000h

      int i = (int)f;

0040109A  fld         dword ptr [esp]

0040109D  call        _ftol2 (4013D4h)

 

And _ftol2() is like this:

 

_ftol2:

004013D4  push        ebp 

004013D5  mov         ebp,esp

004013D7  sub         esp,20h

004013DA  and         esp,0FFFFFFF0h

004013DD  fld         st(0)

004013DF  fst         dword ptr [esp+18h]

004013E3  fistp       qword ptr [esp+10h]

004013E7  fild        qword ptr [esp+10h]

004013EB  mov         edx,dword ptr [esp+18h]

004013EF  mov         eax,dword ptr [esp+10h]

004013F3  test        eax,eax

004013F5  je          integer_QnaN_or_zero (401433h)

arg_is_not_integer_QnaN:

004013F7  fsubp       st(1),st

004013F9  test        edx,edx

004013FB  jns         positive (40141Bh)

004013FD  fstp        dword ptr [esp]

00401400  mov         ecx,dword ptr [esp]

00401403  xor         ecx,80000000h

00401409  add         ecx,7FFFFFFFh

0040140F  adc         eax,0

00401412  mov         edx,dword ptr [esp+14h]

00401416  adc         edx,0

00401419  jmp         localexit (401447h)

positive:

0040141B  fstp        dword ptr [esp]

0040141E  mov         ecx,dword ptr [esp]

00401421  add         ecx,7FFFFFFFh

00401427  sbb         eax,0

0040142A  mov         edx,dword ptr [esp+14h]

0040142E  sbb         edx,0

00401431  jmp         localexit (401447h)

integer_QnaN_or_zero:

00401433  mov         edx,dword ptr [esp+14h]

00401437  test        edx,7FFFFFFFh

0040143D  jne         arg_is_not_integer_QnaN (4013F7h)

0040143F  fstp        dword ptr [esp+18h]

00401443  fstp        dword ptr [esp+18h]

localexit:

00401447  leave           

00401448  ret             

 

 

The code looks longer but it is actually faster than __ftol() since usually only the first part is executed. This one doesn’t touch the FPU control word, but still works regardless of the FPU state. It’s faster than VC6, but it’s again a function call for each cast. Ideally we would like to remove the function call and use the fist instruction directly.

 

 

 

3)     /QIfist

 

The typical solution to enable fast casts is to use the /QIfist compilation flag, which works both in VC6 and VC7. The compiled code then becomes:

 

10:       volatile float f = 1.6f;

00401003   mov         dword ptr [esp],3FCCCCCDh

11:       int i = (int)f;

0040100B   fld         dword ptr [esp]

0040100F   fistp       qword ptr [esp]

00401013   mov         eax,dword ptr [esp]

 

We see that all calls to external functions disappeared, and the cast is directly performed through the “fistp” instruction, inlined within the main code. The problem with this is that the result depends on the current FPU rounding mode. For example with the 1.6 floating-point value, the cast would typically give 2 instead of 1 here, because the default FPU rounding mode is “best” instead of “floor”.

 

So one usual solution is to first setup the FPU rounding mode to “floor”, at the start of your program, and let /QIfist handle all the fast casts afterwards. This works:

 

            _controlfp(_RC_CHOP,          _MCW_RC);

 

            volatile float f = 1.6f;

            int i = (int)f;

 

An alternative to /QIfist is to use your own cast function, like this:

 

            __forceinline int MyCast(float f)

            {

                        int i;

                        _asm    fld        f

                        _asm    fistp      i

                        return   i;

            }

 

But it’s not really better than /QIfist, just a different way to get the same result.

 

So, is that all? Well, no, the real problem starts now.

 

 

4)     Internal rounding mode

 

It appears that the FPU rounding mode also changes the results of internal FPU computations, even when no float-to-int conversion is involved. Check this out:

 

      volatile float Size = 512.0f;

      float Val = logf(Size) / logf(2.0);

 

      printf("%f\n", Val);

 

In Debug mode, the result is going to be different with a “near” mode or a “floor” mode. Note that there’s apparently no problem in Release. In any case, it proves the rounding mode can have a visible impact on FPU computations, even when no float-to-int cast is involved.

 

This is a very strong result. It means all our previous strategies to enable fast casts are actually dangerous, and might have unexpected side effects everywhere, as soon as some FPU computation is involved.

 

At this point, I didn’t find a good way to have both the fast casts and the “safe” rounding mode, i.e. the default one, “near”.

 

 

5)     Threads

 

Threads introduce another issue: the FPU state is saved and restored during each context switching. So the FPU state is thread-dependent, and one must make sure it’s correctly setup in each thread.

 

 

Pierre Terdiman