A quick post about a little-known feature of the “restrict” keyword…
I assume you all know about “restrict” already, but if not, let’s start with a simple example of what it’s useful for.
Say we have a function in some class looking like this:
class RTest
{
public:
RTest() : mMember(0) {}
void DoStuff(int nb, int* target);
int mMember;
};
void RTest::DoStuff(int nb, int* target)
{
while(nb–)
{
*target++ = mMember;
mMember++;
}
}
Looking at the disassembly in Release mode, you get something like the following (the isolated block in the middle is the loop):
00E9EEA0 mov eax,dword ptr [esp+4]
00E9EEA4 test eax,eax
00E9EEA6 je RTest::DoStuff+1Fh (0E9EEBFh)
00E9EEA8 mov edx,dword ptr [esp+8]
00E9EEAC push esi
00E9EEAD lea ecx,[ecx]
00E9EEB0 mov esi,dword ptr [ecx] // Load mMember
00E9EEB2 mov dword ptr [edx],esi // *target = mMember
00E9EEB4 inc dword ptr [ecx] // mMember++
00E9EEB6 dec eax
00E9EEB7 add edx,4 // target++
00E9EEBA test eax,eax
00E9EEBC jne RTest::DoStuff+10h (0E9EEB0h)
00E9EEBE pop esi
00E9EEBF ret 8
So as you can see, there is a read-modify-write operation on mMember each time, and then mMember is reloaded once again to write it to the target buffer. This is not very efficient. Loads & writes to memory are slower than loads & writes to registers for example. But more importantly, this creates a lot of LHS since we clearly load what we just wrote. On a platform like the Xbox, where an LHS is a ~60 cycles penalty on average, this is a killer. Generally speaking, any piece of code doing “mMember++” is a potential LHS, and something to keep an eye on.
There are various ways to do better than that. One way would be to simply rewrite the code so that mMember is explicitly kept in a local variable:
void RTest::DoStuffLocal(int nb, int* target)
{
int local = mMember;
while(nb–)
{
*target++ = local;
local++;
}
mMember = local;
}
This produces the following disassembly:
010AEED0 mov edx,dword ptr [esp+4]
010AEED4 mov eax,dword ptr [ecx] // Load mMember
010AEED6 test edx,edx
010AEED8 je RTest::DoStuffLocal+1Ch (10AEEECh)
010AEEDA push esi
010AEEDB mov esi,dword ptr [esp+0Ch]
010AEEDF nop
010AEEE0 mov dword ptr [esi],eax // *target = mMember
010AEEE2 dec edx
010AEEE3 add esi,4 // target++
010AEEE6 inc eax // mMember++
010AEEE7 test edx,edx
010AEEE9 jne RTest::DoStuffLocal+10h (10AEEE0h)
010AEEEB pop esi
010AEEEC mov dword ptr [ecx],eax // Store mMember
010AEEEE ret 8
This is pretty much what you expect from the source code: you see that the load has been moved outside of the loop, our local variable has been mapped to the eax register, the LHS are gone, and mMember is properly updated only once, after the loop has ended.
Note that the compiler inserted a nop just before the loop. This is simply because loops should be aligned to 16-bytes boundaries to be the most efficient.
Another way to achieve the same result without modifying the main code is to use the restrict keyword. Just mark the target pointer as restricted, like this:
void RTest::DoStuffRestricted(int nb, int* __restrict target)
{
while(nb–)
{
*target++ = mMember;
mMember++;
}
}
This produces the following disassembly:
010AEF00 mov edx,dword ptr [esp+4]
010AEF04 test edx,edx
010AEF06 je RTest::DoStuffRestricted+1Eh (10AEF1Eh)
010AEF08 mov eax,dword ptr [ecx] // Load mMember
010AEF0A push esi
010AEF0B mov esi,dword ptr [esp+0Ch]
010AEF0F nop
010AEF10 mov dword ptr [esi],eax // *target = mMember
010AEF12 dec edx
010AEF13 add esi,4 // target++
010AEF16 inc eax // mMember++
010AEF17 test edx,edx
010AEF19 jne RTest::DoStuffRestricted+10h (10AEF10h)
010AEF1B mov dword ptr [ecx],eax // Store mMember
010AEF1D pop esi
010AEF1E ret 8
In other words, this is almost exactly the same disassembly as for the solution using the local variable - but without the need to actually modify the main source code.
What happened here should not be a surprise: without __restrict, the compiler had no way to know that the target pointer was not potentially pointing to mMember itself. So it had to assume the worst and generate “safe” code that would work even in that unlikely scenario. Using __restrict however, told the compiler that the memory pointed to by “target” was accessed through that pointer only (and pointers copied from it). In particular, it promised the compiler that “this”, the implicit pointer from the RTest class, could not point to the same memory as “target”. And thus, it is now safe to keep mMember in a register for the duration of the loop.
So far, so good. This is pretty much a textbook example of how to use __restrict and what it is useful for. The only important point until now, really, is this: as you can see from the disassembly, __restrict has a clear, real impact on generated code. Just in case you had any doubts…
Now the reason for this post is something more subtle than this: how do we “restrict this”? How do we restrict the implicit “this” pointer from C++ ?
Consider the following, modified example, where our target pointer is now a class member:
class RTest
{
public:
RTest() : mMember(0), mTarget(0) {}
int DoStuffClassMember(int nb);
int mMember;
int* mTarget;
};
int RTest::DoStuffClassMember(int nb)
{
while(nb–)
{
*mTarget++ = mMember;
mMember++;
}
return mMember;
}
Suddenly we can’t easily mark the target pointer as restricted anymore, and the generated code looks pretty bad:
0141EF60 mov eax,dword ptr [esp+4]
0141EF64 test eax,eax
0141EF66 je RTest::DoStuffClassMember+23h (141EF83h)
0141EF68 push esi
0141EF69 mov edx,4
0141EF6E push edi
0141EF6F nop
0141EF70 mov esi,dword ptr [ecx+4] // mTarget
0141EF73 mov edi,dword ptr [ecx] // mMember
0141EF75 mov dword ptr [esi],edi // *mTarget = mMember;
0141EF77 add dword ptr [ecx+4],edx // mTarget++
0141EF7A inc dword ptr [ecx] // mMember++
0141EF7C dec eax
0141EF7D test eax,eax
0141EF7F jne RTest::DoStuffClassMember+10h (141EF70h)
0141EF81 pop edi
0141EF82 pop esi
0141EF83 mov eax,dword ptr [ecx]
0141EF85 ret 4
That’s pretty much as bad as it gets: 2 loads, 2 read-modify-writes, 2 LHS for each iteration of that loop. This is what Christer Ericson refers to as the “C++ abstraction penalty”: generally speaking, accessing class members within loops is a very bad idea. It is usually much better to load those class member to local variables before the loop starts, or pass them to the function as external parameters.
As we saw in the previous example, an alternative would be to mark the target pointer as restricted. In this particular case though, it seems difficult to do since the pointer is a class member. But let’s try this anyway, since it compiles:
class RTest
{
public:
RTest() : mMember(0), mTarget(0) {}
int DoStuffClassMember(int nb);
int mMember;
int* __restrict mTarget;
};
Generated code is:
00A8EF60 mov eax,dword ptr [esp+4]
00A8EF64 test eax,eax
00A8EF66 je RTest::DoStuffClassMember+23h (0A8EF83h)
00A8EF68 push esi
00A8EF69 mov edx,4
00A8EF6E push edi
00A8EF6F nop
00A8EF70 mov esi,dword ptr [ecx+4]
00A8EF73 mov edi,dword ptr [ecx]
00A8EF75 mov dword ptr [esi],edi
00A8EF77 add dword ptr [ecx+4],edx
00A8EF7A inc dword ptr [ecx]
00A8EF7C dec eax
00A8EF7D test eax,eax
00A8EF7F jne RTest::DoStuffClassMember+10h (0A8EF70h)
00A8EF81 pop edi
00A8EF82 pop esi
00A8EF83 mov eax,dword ptr [ecx]
00A8EF85 ret 4
Nope, didn’t work, this is exactly the same code as before.
What we really want here is to mark “this” as restricted, since “this” is the pointer we use to access both mTarget and mMember. With that goal in mind, a natural thing to try is, well, exactly that:
int RTest::DoStuffClassMember(int nb)
{
RTest* __restrict RThis = this;
while(nb–)
{
*RThis->mTarget++ = RThis->mMember;
RThis->mMember++;
}
return RThis->mMember;
}
This produces the following code:
0114EF60 push esi
0114EF61 mov esi,dword ptr [esp+8]
0114EF65 test esi,esi
0114EF67 je RTest::DoStuffClassMember+26h (114EF86h)
0114EF69 mov edx,dword ptr [ecx] // mMember
0114EF6B mov eax,dword ptr [ecx+4] // mTarget
0114EF6E mov edi,edi
0114EF70 mov dword ptr [eax],edx // *mTarget = mMember
0114EF72 dec esi
0114EF73 add eax,4 // mTarget++
0114EF76 inc edx // mMember++
0114EF77 test esi,esi
0114EF79 jne RTest::DoStuffClassMember+10h (114EF70h)
0114EF7B mov dword ptr [ecx+4],eax // Store mTarget
0114EF7E mov dword ptr [ecx],edx // Store mMember
0114EF80 mov eax,edx
0114EF82 pop esi
0114EF83 ret 4
0114EF86 mov eax,dword ptr [ecx]
0114EF88 pop esi
0114EF89 ret 4
It actually works! Going through a restricted this, despite the unusual and curious syntax, does solve all the problems from the original code. Both mMember and mTarget are loaded into registers, kept there for the duration of the loop, and stored back only once in the end.
Pretty cool.
If we ignore the horrible syntax, that is. Imagine a whole codebase full of “RThis->mMember++;”, this wouldn’t be very nice.
There is actually another way to “restrict this”. I thought it only worked with GCC, but this is not true. The following syntax actually compiles and does the expected job with Visual Studio as well. Just mark the function itself as restricted:
class RTest
{
public:
RTest() : mMember(0), mTarget(0) {}
int DoStuffClassMember(int nb) __restrict;
int mMember;
int* mTarget;
};
int RTest::DoStuffClassMember(int nb) __restrict
{
while(nb–)
{
*mTarget++ = mMember;
mMember++;
}
return mMember;
}
This generates exactly the same code as with our fake “this” pointer:
0140EF60 push esi
0140EF61 mov esi,dword ptr [esp+8]
0140EF65 test esi,esi
0140EF67 je RTest::DoStuffClassMember+26h (140EF86h)
0140EF69 mov edx,dword ptr [ecx]
0140EF6B mov eax,dword ptr [ecx+4]
0140EF6E mov edi,edi
0140EF70 mov dword ptr [eax],edx
0140EF72 dec esi
0140EF73 add eax,4
0140EF76 inc edx
0140EF77 test esi,esi
0140EF79 jne RTest::DoStuffClassMember+10h (140EF70h)
0140EF7B mov dword ptr [ecx+4],eax
0140EF7E mov dword ptr [ecx],edx
0140EF80 mov eax,edx
0140EF82 pop esi
0140EF83 ret 4
0140EF86 mov eax,dword ptr [ecx]
0140EF88 pop esi
0140EF89 ret 4
This is the official way to “restrict this”, and until recently I didn’t know it worked in Visual Studio. Yay!
A few closing comments about the above code…. Astute readers would have noticed a few things that I didn’t mention yet:
The curious “mov edi, edi” clearly doesn’t do anything, and it would be easy to blame the compiler here for being stupid. Well, the compiler is stupid and does generate plenty of foolish things, but this is not one of them. Notice how it happens right before the loop starts? This is the equivalent of the “nop” we previously saw. The reason why the compiler chose not to use nops here is because nop takes only 1 byte (its opcode is “90”), so we would have needed 2 of them here to align the loop to 16-bytes. Using a useless 2-bytes instruction achieves the same goal, but with a single instruction.
Finally, note that the main loop actually touches 3 registers instead of 2:
- esi, the loop counter (nb–)
- eax, the target address mTarget
- edx, the data member mMember
This is not optimal, there is no need to touch the loop counter there. It would probably have been more efficient to store the edx limit within esi, something like:
add esi, edx // esi = loop limit
Loop :
mov dword ptr [eax], edx
add eax, 4
inc edx
cmp edx, esi
jne Loop
This moves all ‘dec esi’ operations out of the loop, which might have been a better strategy. Oh well. Maybe the compiler is stupid after all