Archive for the 'Programming' Category

SIMD box-triangle overlap function

Thursday, March 20th, 2014

A long time ago (around 2001), I helped Tomas Möller optimize a triangle-vs-box, SAT-based overlap test. At that time I was working on the Opcode library, and such a function was needed to achieve my goals.

When working on Opcode 2.0, I wrote a SIMD version of that same function. Since then I have found several other SIMD implementations of that code, and while benchmarking them I found that most of them were surprisingly slow - sometimes slower than the original code! I cannot really explain why, but it seems to indicate that writing the SIMD version is not as straightforward as one could have imagined. And thus, I am now releasing my implementation, in the hope somebody will find it useful.

There are two versions of the original code, labelled “old code” and “new, faster code” online. Despite the claims, on my Windows-based machine the “faster code” is significantly slower. It seems correct and expected that doing the “class III” tests last is the best, but I suppose this depends on both the architecture and the test scenario. In any case both versions (more or less verbatim, not quite) are included in the test benchmark.

The original code contains a lot of early exits, and thus the time it takes depends a lot on what codepath is taken. The best case is when you reach the first early exit. The worst case is when you reach the end of the function, i.e. the triangle and the box overlap. I believe one of the biggest mistakes made in the slow SIMD implementations was to get rid of branches and early exits at all costs (”Hey look! Zero branches! It must be super fast!”). Well of course that is a very naive view, things are way more subtle and complex than that. The fastest code is the one that is never executed, and thus, you should embrace early exits, not avoid them. In fact, my implementation contains one more early exit than the original code, for when the box fully contains the triangle (this was an important case for me in some application).

I have 4 different test scenarios:
- best case
- worst case
- no hit
- mixed

“Best case” is the aforementioned new early exit, for when the box fully contains the triangle. This is basically a point-in-AABB test, it is the first early exit in the code, and thus the “best case” is when we reach that codepath.

“Worst case” is when the box touches the triangle, but the triangle is not fully inside. In that case we reach the end of the overlap function without taking any early exits. This should be the slowest codepath.

“No hit” is when the triangle does not touch the box, and we reach one of the “regular” early exits (i.e. the ones that also existed in the non-SIMD code). Time depends on what codepath is reached.

“Mixed” is just a mix of the above cases.

Here are typical results on my test PC:

Version1 (Möller’s “old code”):

Time (default): 5072
Time (default): 3773
Time (default): 953
Time (default): 5088
Time (optimized): 246
Time (optimized): 1898
Time (optimized): 329
Time (optimized): 1544
Hits0/Hits1: 16384/16384
Hits0/Hits1: 16384/16384
Hits0/Hits1: 0/0
Hits0/Hits1: 12420/12420
Best case:  Speedup: 20.617886
Worst case: Speedup: 1.987882
No hit:     Speedup: 2.896657
Mixed:      Speedup: 3.295337

Version2 (Möller’s “new, faster code”):

Time (default): 6209
Time (default): 4714
Time (default): 1304
Time (default): 5199
Time (optimized): 237
Time (optimized): 1822
Time (optimized): 316
Time (optimized): 1540
Hits0/Hits1: 16384/16384
Hits0/Hits1: 16384/16384
Hits0/Hits1: 0/0
Hits0/Hits1: 12420/12420
Best case:  Speedup: 26.198313
Worst case: Speedup: 2.587267
No hit:     Speedup: 4.126582
Mixed:      Speedup: 3.375974

So what we see here:

- the “new, faster code” is in fact slower. Thus, ignore “Version2″ from now on.

- the non-SIMD and SIMD version all find the same numbers of hits (as they should). I could have better tests for this but I’ve been using the SIMD version for a while now and it should be quite solid.

- the new, additional early exit gives me a 20X speedup. Not bad for a SIMD version whose theoretical maximum speedup is 4X. In the SIMD code, this is taken care of by this snippet:

if(1)
{
const unsigned int MaskI = 0×7fFFffFF;
__m128 cV = _mm_and_ps(v0V, _mm_load1_ps((const float*)&MaskI));
const unsigned int test = _mm_movemask_ps(_mm_sub_ps(cV, extentsV));
if((test&7)==7)
return 1;
}

That is well worth it if you ask me. Helped enormously in some past projects of mine.

- Speedups otherwise vary between 2X and 3X, which is what you usually expect from a good SIMD implementation.

And that’s about it. This is straightforward but apparently it is easy to write a bad SIMD version, that ends up slower than the original. I suppose some people just write the code, assume it’s good, and do not run benchmarks? I don’t know.

Bitcoin tip jar is here if you end up using that code. Project files are here.

Realtime fracture demo using GRB

Wednesday, March 27th, 2013

Here’s the sequel to the previous fracture video I posted some time ago:

http://www.youtube.com/watch?v=ATU6IGCMpUA&feature=youtu.be

It’s using “GRB”, which is basically the GPU-version of PhysX.

CUDA procedural terrain

Thursday, January 10th, 2013

You may remember I have a thing for fractal landscapes. So here’s a CUDA “hello world” (my first CUDA test, never bothered learning it before) featuring a good old multifractal directly from Ken Musgrave’s book. Not optimized or anything, nothing special really, but damn graphics cards are fast those days (this runs @ 60 fps on my GTX 480).

Click to download.

A fractal landscape in CUDA

A fractal landscape in CUDA

Precomputed node sorting in BV-tree traversal

Friday, November 23rd, 2012

Here is another post about a small optimization I just came up with. This time the context is BV-tree traversal, for raycasts or sweeps.

So let’s say you have some raycast traversal code for an AABB-tree. If you do not do any node sorting to drive the tree traversal, your code may look like this (non-recursive version):

const AABBTreeNode* node = root;
udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;

while(Nb)
{
node = Stack[--Nb];

if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
}
}

This should be pretty clear. We fetch nodes from the stack, perform segment-AABB tests against the nodes’ bounding boxes.

If the node is a leaf, we test its primitive(s). In case of a hit, we reduce the length of the query segment, to reduce the total number of visited nodes.

If the node is not a leaf, we simply push its children to our stack – in reverse order so that “Pos” gets tested first -, and continue. Easy peasy.

Now this is a version without “node sorting”: we always push the 2 children to the stack in the same order, and thus we will always visit a node’s “Positive” child P before the node’s “Negative” child N. This is sometimes just fine, in particular for overlap tests where ordering does not usually matter. But for raycasts, and especially for sweeps, it is better to “sort the nodes” and make sure we visit the “closest node” first, i.e. the one that is “closer” to the ray’s origin. The reason is obvious: because we “shrink the ray” when a hit is found, if we visit the the closest node P first and shrink the ray there, the shrunk segment may not collide with N at all, and we will thus avoid visiting an entire sub-tree.

Node sorting is not strictly necessary. But it is a good way to “optimize the worst case”, and make sure the code performs adequately for all raycast directions. It has, nonetheless, an overhead, and it is likely to make the best case a little bit slower. A good read about this is the recently released thesis from Jacco Bikker, which contains a nice little code snippet to implement SIMD node sorting for packet tracing.

When dealing with simpler one-raycast-at-a-time traversals, there are usually 2 ways to implement the sorting, depending on your implementation choice for the segment-AABB test. If your segment-AABB test produces “near” and “far” distance values as a side-effect of the test, all you need to do is compare the “near” values. If however you are using a SAT-based segment-AABB, those near and far values are typically not available and an extra distance computation has to be performed. It is not necessary to use a very accurate distance test, so one option is simply to project the nodes’ centers on the ray direction, and use the resulting values. If we modify the code above to do that, we now get something like:

const AABBTreeNode* node = root;

udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;

while(Nb)
{
node = Stack[--Nb];

if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
const Point& BoxCenterP = node->GetPos()->mBoxCenter;
const Point& BoxCenterN = node->GetNeg()->mBoxCenter;
if(((BoxCenterP - BoxCenterN).Dot(RayDir))<0.0f)
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
else
{
Stack[Nb++] = node->GetPos();
Stack[Nb++] = node->GetNeg();
}
}
}
}

The above code could be improved, the branch could be removed, the last push to the stack could be avoided since it will otherwise probably create an LHS, etc. But this post is about node sorting, so I will only focus on this part.

It does not look like much, but it turns out that it can have a very measurable performance impact when the rest of the function is already highly optimized. It fetches the 2 children nodes (cache misses), it has a float compare (very slow on Xbox), and that dot product is annoying.

So, let’s get rid of all of these.

In order to do that, we will need to go back in time a bit. To the days of the painter’s algorithm, before Z-Buffers, when it was mandatory to render opaque polygons back-to-front. At that time even radix-sorting all triangles was considered too slow, so we often just… precomputed the sorting. We had 8 precomputed “index buffers” for 8 possible main view directions, and the whole sorting business became free. There are still various traces of those early algorithms online. This thread mentions both Iq’s version, called “Volumetric sort” and the similar article I wrote some 10 years before that. That was back in 1995, so the idea itself is nothing new.

What is new however, I think, is applying the same strategy to BV-tree traversals. I did not see this before.

So there are 8 possible main view directions. For each of them, and for each node, we need to precompute the closest child. Since we have only 2 nodes in the binary tree, we need only one bit to determine which one is closest, and thus we need 8 bits per node to encode the precomputed sorting. That’s the memory overhead for the technique, and it may or may not be acceptable to you depending on how easy it is to squeeze one more byte in your nodes.

The precomputation part is trivial. A vanilla non-optimized version could look like the following, performed on each node after the tree has been built:

static bool gPrecomputeSort(AABBTreeNode* node)
{
if(node->IsLeaf())
return true;
const AABBTreeNode* P = node->GetPos();
const AABBTreeNode* N = node->GetNeg();
const Point& C0 = P->mBoxCenter;
const Point& C1 = N->mBoxCenter;

Point DirPPP(1.0f, 1.0f, 1.0f); DirPPP.Normalize();
Point DirPPN(1.0f, 1.0f, -1.0f); DirPPN.Normalize();
Point DirPNP(1.0f, -1.0f, 1.0f); DirPNP.Normalize();
Point DirPNN(1.0f, -1.0f, -1.0f); DirPNN.Normalize();
Point DirNPP(-1.0f, 1.0f, 1.0f); DirNPP.Normalize();
Point DirNPN(-1.0f, 1.0f, -1.0f); DirNPN.Normalize();
Point DirNNP(-1.0f, -1.0f, 1.0f); DirNNP.Normalize();
Point DirNNN(-1.0f, -1.0f, -1.0f); DirNNN.Normalize();

const bool bPPP = ((C0 - C1).Dot(DirPPP))<0.0f;
const bool bPPN = ((C0 - C1).Dot(DirPPN))<0.0f;
const bool bPNP = ((C0 - C1).Dot(DirPNP))<0.0f;
const bool bPNN = ((C0 - C1).Dot(DirPNN))<0.0f;
const bool bNPP = ((C0 - C1).Dot(DirNPP))<0.0f;
const bool bNPN = ((C0 - C1).Dot(DirNPN))<0.0f;
const bool bNNP = ((C0 - C1).Dot(DirNNP))<0.0f;
const bool bNNN = ((C0 - C1).Dot(DirNNN))<0.0f;

udword Code = 0;
if(!bPPP)
Code |= (1<<7); // Bit 0: PPP
if(!bPPN)
Code |= (1<<6); // Bit 1: PPN
if(!bPNP)
Code |= (1<<5); // Bit 2: PNP
if(!bPNN)
Code |= (1<<4); // Bit 3: PNN
if(!bNPP)
Code |= (1<<3); // Bit 4: NPP
if(!bNPN)
Code |= (1<<2); // Bit 5: NPN
if(!bNNP)
Code |= (1<<1); // Bit 6: NNP
if(!bNNN)
Code |= (1<<0); // Bit 7: NNN

node->mCode = Code;
return true;
}

Then the traversal code simply becomes:

const AABBTreeNode* node = root;

udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;

const udword DirMask = ComputeDirMask(RayDir);

while(Nb)
{
node = Stack[--Nb];

if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
if(node->mCode & DirMask)
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
else
{
Stack[Nb++] = node->GetPos();
Stack[Nb++] = node->GetNeg();
}
}
}
}

As you can see, all the bad bits are gone, and node sorting is now a single AND. The “direction mask” is precomputed once before the traversal starts, so its overhead is completely negligible. An implementation could be:

//! Integer representation of a floating-point value.
#define IR(x) ((udword&)(x))

static udword ComputeDirMask(const Point& dir)
{
const udword X = IR(dir.x)>>31;
const udword Y = IR(dir.y)>>31;
const udword Z = IR(dir.z)>>31;
const udword BitIndex = Z|(Y<<1)|(X<<2);
return 1<<BitIndex;
}

And that’s it. Gains vary from one scene to another and especially from one platform to another, but this is another useful trick in our quest to “speed of light” BV-tree traversals.

Restrict this

Tuesday, November 20th, 2012

A quick post about a little-known feature of the “restrict” keyword…

I assume you all know about “restrict” already, but if not, let’s start with a simple example of what it’s useful for.

Say we have a function in some class looking like this:

class RTest
{
public:
RTest() : mMember(0)    {}

void DoStuff(int nb, int* target);

int mMember;
};

void RTest::DoStuff(int nb, int* target)
{
while(nb–)
{
*target++ = mMember;
mMember++;
}
}

Looking at the disassembly in Release mode, you get something like the following (the isolated block in the middle is the loop):

00E9EEA0  mov         eax,dword ptr [esp+4]
00E9EEA4  test        eax,eax
00E9EEA6  je          RTest::DoStuff+1Fh (0E9EEBFh)
00E9EEA8  mov         edx,dword ptr [esp+8]
00E9EEAC  push        esi
00E9EEAD  lea         ecx,[ecx]

00E9EEB0 mov         esi,dword ptr [ecx] // Load mMember
00E9EEB2  mov         dword ptr [edx],esi // *target = mMember
00E9EEB4 inc         dword ptr [ecx] // mMember++
00E9EEB6  dec         eax
00E9EEB7  add         edx,4 // target++
00E9EEBA  test        eax,eax
00E9EEBC  jne         RTest::DoStuff+10h (0E9EEB0h)

00E9EEBE  pop         esi
00E9EEBF  ret         8

So as you can see, there is a read-modify-write operation on mMember each time, and then mMember is reloaded once again to write it to the target buffer. This is not very efficient. Loads & writes to memory are slower than loads & writes to registers for example. But more importantly, this creates a lot of LHS since we clearly load what we just wrote. On a platform like the Xbox, where an LHS is a ~60 cycles penalty on average, this is a killer. Generally speaking, any piece of code doing “mMember++” is a potential LHS, and something to keep an eye on.

There are various ways to do better than that. One way would be to simply rewrite the code so that mMember is explicitly kept in a local variable:

void RTest::DoStuffLocal(int nb, int* target)
{
int local = mMember;
while(nb–)
{
*target++ = local;
local++;
}
mMember = local;
}

This produces the following disassembly:

010AEED0  mov         edx,dword ptr [esp+4]
010AEED4 mov         eax,dword ptr [ecx] // Load mMember
010AEED6  test        edx,edx
010AEED8  je          RTest::DoStuffLocal+1Ch (10AEEECh)
010AEEDA  push        esi
010AEEDB  mov         esi,dword ptr [esp+0Ch]
010AEEDF nop

010AEEE0  mov         dword ptr [esi],eax // *target = mMember
010AEEE2  dec         edx
010AEEE3  add         esi,4 // target++
010AEEE6 inc         eax // mMember++
010AEEE7  test        edx,edx
010AEEE9  jne         RTest::DoStuffLocal+10h (10AEEE0h)

010AEEEB  pop         esi
010AEEEC mov         dword ptr [ecx],eax // Store mMember
010AEEEE  ret         8

This is pretty much what you expect from the source code: you see that the load has been moved outside of the loop, our local variable has been mapped to the eax register, the LHS are gone, and mMember is properly updated only once, after the loop has ended.

Note that the compiler inserted a nop just before the loop. This is simply because loops should be aligned to 16-bytes boundaries to be the most efficient.

Another way to achieve the same result without modifying the main code is to use the restrict keyword. Just mark the target pointer as restricted, like this:

void RTest::DoStuffRestricted(int nb, int* __restrict target)
{
while(nb–)
{
*target++ = mMember;
mMember++;
}
}

This produces the following disassembly:

010AEF00  mov         edx,dword ptr [esp+4]
010AEF04  test        edx,edx
010AEF06  je          RTest::DoStuffRestricted+1Eh (10AEF1Eh)
010AEF08 mov         eax,dword ptr [ecx] // Load mMember
010AEF0A  push        esi
010AEF0B  mov         esi,dword ptr [esp+0Ch]
010AEF0F  nop

010AEF10  mov         dword ptr [esi],eax // *target = mMember
010AEF12  dec         edx
010AEF13  add         esi,4 // target++
010AEF16 inc         eax // mMember++
010AEF17  test        edx,edx
010AEF19  jne         RTest::DoStuffRestricted+10h (10AEF10h)

010AEF1B mov         dword ptr [ecx],eax // Store mMember
010AEF1D  pop         esi
010AEF1E  ret         8

In other words, this is almost exactly the same disassembly as for the solution using the local variable - but without the need to actually modify the main source code.

What happened here should not be a surprise: without __restrict, the compiler had no way to know that the target pointer was not potentially pointing to mMember itself. So it had to assume the worst and generate “safe” code that would work even in that unlikely scenario. Using __restrict however, told the compiler that the memory pointed to by “target” was accessed through that pointer only (and pointers copied from it). In particular, it promised the compiler that “this”, the implicit pointer from the RTest class, could not point to the same memory as “target”. And thus, it is now safe to keep mMember in a register for the duration of the loop.

So far, so good. This is pretty much a textbook example of how to use __restrict and what it is useful for. The only important point until now, really, is this: as you can see from the disassembly, __restrict has a clear, real impact on generated code. Just in case you had any doubts…

Now the reason for this post is something more subtle than this: how do we “restrict this”? How do we restrict the implicit “this” pointer from C++ ?

Consider the following, modified example, where our target pointer is now a class member:

class RTest
{
public:
RTest() : mMember(0), mTarget(0) {}

int DoStuffClassMember(int nb);

int mMember;
int*  mTarget;
};

int RTest::DoStuffClassMember(int nb)
{
while(nb–)
{
*mTarget++ = mMember;
mMember++;
}
return mMember;
}

Suddenly we can’t easily mark the target pointer as restricted anymore, and the generated code looks pretty bad:

0141EF60  mov         eax,dword ptr [esp+4]
0141EF64  test        eax,eax
0141EF66  je          RTest::DoStuffClassMember+23h (141EF83h)
0141EF68  push        esi
0141EF69  mov         edx,4
0141EF6E  push        edi
0141EF6F  nop

0141EF70  mov         esi,dword ptr [ecx+4] // mTarget
0141EF73  mov         edi,dword ptr [ecx] // mMember
0141EF75  mov         dword ptr [esi],edi // *mTarget = mMember;
0141EF77  add         dword ptr [ecx+4],edx // mTarget++
0141EF7A  inc         dword ptr [ecx] // mMember++
0141EF7C  dec         eax
0141EF7D  test        eax,eax
0141EF7F  jne         RTest::DoStuffClassMember+10h (141EF70h)

0141EF81  pop         edi
0141EF82  pop         esi
0141EF83  mov         eax,dword ptr [ecx]
0141EF85  ret         4

That’s pretty much as bad as it gets: 2 loads, 2 read-modify-writes, 2 LHS for each iteration of that loop. This is what Christer Ericson refers to as the “C++ abstraction penalty”: generally speaking, accessing class members within loops is a very bad idea. It is usually much better to load those class member to local variables before the loop starts, or pass them to the function as external parameters.

As we saw in the previous example, an alternative would be to mark the target pointer as restricted. In this particular case though, it seems difficult to do since the pointer is a class member. But let’s try this anyway, since it compiles:

class RTest
{
public:
RTest() : mMember(0), mTarget(0)   {}

int DoStuffClassMember(int nb);

int mMember;
int__restrict mTarget;
};

Generated code is:

00A8EF60  mov         eax,dword ptr [esp+4]
00A8EF64  test        eax,eax
00A8EF66  je          RTest::DoStuffClassMember+23h (0A8EF83h)
00A8EF68  push        esi
00A8EF69  mov         edx,4
00A8EF6E  push        edi
00A8EF6F  nop

00A8EF70  mov         esi,dword ptr [ecx+4]
00A8EF73  mov         edi,dword ptr [ecx]
00A8EF75  mov         dword ptr [esi],edi
00A8EF77  add         dword ptr [ecx+4],edx
00A8EF7A  inc         dword ptr [ecx]
00A8EF7C  dec         eax
00A8EF7D  test        eax,eax
00A8EF7F  jne         RTest::DoStuffClassMember+10h (0A8EF70h)

00A8EF81  pop         edi
00A8EF82  pop         esi
00A8EF83  mov         eax,dword ptr [ecx]
00A8EF85  ret         4

Nope, didn’t work, this is exactly the same code as before.

What we really want here is to mark “this” as restricted, since “this” is the pointer we use to access both mTarget and mMember. With that goal in mind, a natural thing to try is, well, exactly that:

int RTest::DoStuffClassMember(int nb)
{
RTest* __restrict RThis = this;
while(nb–)
{
*RThis->mTarget++ = RThis->mMember;
RThis->mMember++;
}
return RThis->mMember;
}

This produces the following code:

0114EF60  push        esi
0114EF61  mov         esi,dword ptr [esp+8]
0114EF65  test        esi,esi
0114EF67  je          RTest::DoStuffClassMember+26h (114EF86h)
0114EF69 mov         edx,dword ptr [ecx] // mMember
0114EF6B mov         eax,dword ptr [ecx+4] // mTarget
0114EF6E  mov         edi,edi

0114EF70  mov         dword ptr [eax],edx // *mTarget = mMember
0114EF72  dec         esi
0114EF73 add         eax,4 // mTarget++
0114EF76 inc         edx // mMember++
0114EF77  test        esi,esi
0114EF79  jne         RTest::DoStuffClassMember+10h (114EF70h)

0114EF7B mov         dword ptr [ecx+4],eax // Store mTarget
0114EF7E mov         dword ptr [ecx],edx // Store mMember
0114EF80  mov         eax,edx
0114EF82  pop         esi
0114EF83  ret         4
0114EF86  mov         eax,dword ptr [ecx]
0114EF88  pop         esi
0114EF89  ret         4

It actually works! Going through a restricted this, despite the unusual and curious syntax, does solve all the problems from the original code. Both mMember and mTarget are loaded into registers, kept there for the duration of the loop, and stored back only once in the end.

Pretty cool.

If we ignore the horrible syntax, that is. Imagine a whole codebase full of “RThis->mMember++;”, this wouldn’t be very nice.

There is actually another way to “restrict this”. I thought it only worked with GCC, but this is not true. The following syntax actually compiles and does the expected job with Visual Studio as well. Just mark the function itself as restricted:

class RTest
{
public:
RTest() : mMember(0), mTarget(0)   {}

int DoStuffClassMember(int nb)   __restrict;

int mMember;
int*  mTarget;
};

int RTest::DoStuffClassMember(int nb)    __restrict
{
while(nb–)
{
*mTarget++ = mMember;
mMember++;
}
return mMember;
}

This generates exactly the same code as with our fake “this” pointer:

0140EF60  push        esi
0140EF61  mov         esi,dword ptr [esp+8]
0140EF65  test        esi,esi
0140EF67  je          RTest::DoStuffClassMember+26h (140EF86h)
0140EF69  mov         edx,dword ptr [ecx]
0140EF6B  mov         eax,dword ptr [ecx+4]
0140EF6E mov         edi,edi

0140EF70  mov         dword ptr [eax],edx
0140EF72  dec         esi
0140EF73  add         eax,4
0140EF76  inc         edx
0140EF77  test        esi,esi
0140EF79  jne         RTest::DoStuffClassMember+10h (140EF70h)

0140EF7B  mov         dword ptr [ecx+4],eax
0140EF7E  mov         dword ptr [ecx],edx
0140EF80  mov         eax,edx
0140EF82  pop         esi
0140EF83  ret         4
0140EF86  mov         eax,dword ptr [ecx]
0140EF88  pop         esi
0140EF89  ret         4

This is the official way to “restrict this”, and until recently I didn’t know it worked in Visual Studio. Yay!

A few closing comments about the above code…. Astute readers would have noticed a few things that I didn’t mention yet:

The curious “mov edi, edi” clearly doesn’t do anything, and it would be easy to blame the compiler here for being stupid. Well, the compiler is stupid and does generate plenty of foolish things, but this is not one of them. Notice how it happens right before the loop starts? This is the equivalent of the “nop” we previously saw. The reason why the compiler chose not to use nops here is because nop takes only 1 byte (its opcode is “90”), so we would have needed 2 of them here to align the loop to 16-bytes. Using a useless 2-bytes instruction achieves the same goal, but with a single instruction.

Finally, note that the main loop actually touches 3 registers instead of 2:

  • esi, the loop counter (nb–)
  • eax, the target address mTarget
  • edx, the data member mMember

This is not optimal, there is no need to touch the loop counter there. It would probably have been more efficient to store the edx limit within esi, something like:

add        esi, edx                // esi = loop limit
Loop :
mov       dword ptr [eax], edx
add        eax, 4
inc          edx
cmp       edx, esi
jne         Loop

This moves all ‘dec esi’ operations out of the loop, which might have been a better strategy. Oh well. Maybe the compiler is stupid after all :)

Batching incoherent queries

Saturday, April 7th, 2012

For a binary tree, a typical recursive collision detection function usually looks like this (in this example for a raycast):

void MyTree::Raycast(const Ray& ray, const Node* node)
{
// Perform ray-AABB overlap test
if(!RayAABBOverlap(ray, node->GetAABB()))
return;

// Ray touches the AABB. If the node is a leaf, test against triangle
if(node->IsLeaf())
{
if(RayTriangleOverlap(ray, node->GetTriangle())
RegisterHit();
}
else
{
// Internal node => recurse to left & right children
Raycast(ray, node->GetLeftChild());
Raycast(ray, node->GetRightChild());
}

}

This traditional tree traversal code has a few problems:

  • A vanilla version suffers from many cache misses since we keep jumping from one node to its children, and there is no guarantee that those child nodes will be close to us.
  • Prefetching might help. However a typical prefetch instruction has a relatively high latency, i.e. you need to wait a certain amount of cycles after the prefetch instruction has been issued, before the requested address is effectively “free” to read. The problem with a binary tree such as the above is that there is just not enough work to do in one node (basically just one ray-AABB test).
  • The problem is the same on platforms where you need to DMA the next nodes. Even if you start your DMA right when entering the function, it may not be finished by the time you reach the end of it.

There are different ways to combat this issue. You can try a more cache-friendly tree layout, say van Emde Boas. You can try a more cache-friendly tree, say N-ary instead of binary. But those alternatives often come with their own set of problems, and most lack the simplicity of a binary tree.

So instead, I would like to do my Mike Acton and ask: when is the last time you had to raycast one ray against that tree? This is not the typical case. The typical case, the generic case even, is when you end up with many raycasts against the same tree.

This kind of batched queries has been done before for raycasts. I think it is called “packet query”. But as far as I know, it has always been described for “coherent rays” (and only 4 at a time IIRC), i.e. the kind of rays you get in raytracing when firing N rays from the same pixel (say for supersampling), or from close pixels. In other words the rays are very similar to each other, going in the same direction, and thus they are likely to touch the same nodes during the recursive tree traversal.

But where is this restriction coming from? Who said the rays had to be coherent? They do not have to. It is equally simple to collide N incoherent rays against a mesh.

Here is how.

We need a mesh, i.e. a binary tree, and then a linear array of rays. I will give the code first and then comment it. The code looks roughly like this:

void MyTree::MultiRaycast(int nbRays, const Ray* rays, const Node* node)
{
// Collide all rays against current AABB
int Offset;
{
const AABB& Box = node->GetAABB();

Offset = 0;
int Last = nbRays;

while(Offset!=Last)
{
if(RayAABBOverlap(rays[Offset], Box))
{
Offset++;
}
else
{
// Do a ‘move-to-front’ transform on rays that touch the box
Last–;
Swap(rays[Offset], rays[Last]);
}
}
if(!Offset)
return;
}

// Here, Offset = number of rays that touched the box. The array has been reorganized so
// that those rays that passed the test are now at the beginning of the array. The rays
// that did not touch the box are all at the end.

// If the node is a leaf, test surviving rays against triangle
if(node->IsLeaf())
{
const Triangle T = node->GetTriangle();
for(int i=0;i<Offset;i++)
{
if(RayTriangleOverlap(rays[i], T)
RegisterHit();
}
}
else
{
// Internal node => recurse to left & right children
MultiRaycast(Offset, rays, node->GetLeftChild());
MultiRaycast(Offset, rays, node->GetRightChild());
}

}

So first we test all incoming rays against the node’s AABB. When doing so, we reorganize the rays so that the ones that passed the test are now at the beginning of the array. The rays that did not touch the box are all at the end. This is easily done with a simple ‘move-to-front’ operation. Astute readers will probably notice that this is similar to how the tree itself was built in Opcode, when the ‘positive’ and ‘negative’ primitives got reorganized at build time, in a similar fashion.

After the AABB test, if the node is a leaf, we simply test all surviving rays against the node’s triangle. Otherwise we just recurse as usual, with a simple twist: we pass down the number of surviving rays, not the original number.

And that’s it, done. Simple enough, right? The only vaguely subtle conceptual difficulty is how the recursive descent does not invalidate the re-ordering we did in the parent node. This is simply because regardless of how the child nodes reorganize the array, they are only going to reorganize the number we give them, i.e. the surviving rays. And no matter how those guys get shuffled down the tree, when the code comes back from the recursion (say after visiting the left child), we still pass a valid array of surviving rays to the right child. They are not in the same order anymore, but the permutation has only been done on the first half of the array, so we are still good to go.

So what do we get out of this? Well, a lot:

  • We are now working on N rays at a time, not just one. So that is a lot of work to do in each node, and there is plenty of time for your prefetch or your DMAs to finish.
  • The N rays we operate on are stored in a contiguous array, so accessing them is still very cache friendly. It would not be the case if we would have put all our rays in a hierarchy for example, and collided one ray-tree against the mesh.
  • We have N rays at a time, to collide against the same box or the same triangle. This is a perfect SIMD-friendly situation.
  • We traverse the tree only once now, so even if your tree layout is not cache-friendly at all, the number of traversal-related cache misses is minimized overall. In fact you can imagine it has just been divided by the number of batched rays…
  • We traverse each node only once now, so any extra work that you sometimes must do to access the AABB, like dequantization, is now only performed once instead of N times. Think about it: if you raycast N rays against the same mesh, you are dequantizing the top level AABB N times for example. With this new approach however, you do that only once.
  • In a similar way, we fetch a candidate triangle only once and raycast N rays against it. So the penalties associated with getting the triangle data (and there are quite a few, usually) are also reduced when multiple rays end up touching the same triangle.

I illustrated this idea with raycasts, but of course it works equally well with overlap tests or sweeps.

I do not know if it is a new idea or not, but I haven’t seen it described anywhere before.

Fracture demo from GDC

Friday, March 16th, 2012

http://www.youtube.com/watch?v=QxJacxI4_oE&feature=share

New broad-phase demo

Thursday, December 22nd, 2011

We recently had a friendly-yet-furious internal competition at work, where the goal was to create the fastest broad-phase algorithm ever.

We slowly reached the limits of what we can do with a 3-axes sweep-and-prune (SAP), and so we were trying to finally move away from it. This is not very easy, as SAP is very efficient in typical game scenes. Unfortunately as the total number of objects in games increase, and game worlds become more and more dynamic, the SAP performance increasingly becomes an issue.

Several people attacked the problem with several (and very different) approaches. I submitted two entries: one was the multi-SAP (MSAP) broad-phase I previously mentioned; the other one is something I probably cannot talk about yet, so let’s just call it “broad-phase X” for now. Long story short: this last algorithm won.

Out of curiosity, and since I had done it before with MSAP already, I tried it out in Bullet’s “CDTestFramework” to compare it against Bullet’s algorithms. On my machine the new broad-phase is a little bit more than 4X faster than any of Bullet’s versions, and more than 2X faster than MSAP. Of course the relative performances of all those algorithms vary with the total number of objects in the scene, how many of them are static, how many of them are moving, how fast they are moving, etc. So there isn’t usually a clear winner for all possible scenarios. Still, at work we tested those new algorithms in dozens and dozens of scenes with different characteristics, and this “broad-phase X” always performed remarkably well.

The demo is straight out of Bullet 2.78, compiled in Release mode with full optimizations and the SSE2 flag enabled.

Click here to download

The new broad-phase should be available in PhysX 3.3. Stay tuned! :)

Pensées du jour

Friday, September 30th, 2011

To iterate is human, to recurse is divine, and the Big Bang was a stack overflow.

—-

I should have a patent on patents. People would pay me recursively and I would get a cash overflow!

—-

The true meaning of LHS is “Little Hidden Stall”.

—-

A stupid magician is a Garcimoron!  (*)

(*) I realize only 1% of the audience will get that one :)

Static/Bipartite SAP

Wednesday, September 14th, 2011

Another idea for optimizing the SAP when we have a small amount of dynamic objects moving amongst a large amount of static objects:

Put the dynamics in their own vanilla SAP. Get the dynamic-vs-dynamic overlaps from there, nothing new.

Put the dynamics & statics in a special SAP (“Static SAP”, or “Bipartite SAP”) modified to only report dynamic-vs-static overlaps:

  • The statics are stored in the SAP as usual (same box/endpoint structures as before)
  • The dynamics however are stored in a separate array, using a new box structure.
  • When the dynamic objects are inserted in the SAP, their endpoints are not inserted into the same array as for statics. They are stored directly in the new box structure, i.e. in the (dynamic) box class itself. They link to the static arrays of endpoints (their virtual position within those arrays, if they would have been inserted there as usual)
  • When running “updateObject” on those dynamic objects, we move their endpoints virtually instead of for real. That is, we still look for the new sorted positions within the arrays of endpoints, but we don’t actually shift the buffers anymore. The update is a const function for the main SAP data-structure, the only thing that gets modified is the data within the dynamic box we’re updating. In other words we don’t have real “swaps” anymore. No more memmoves, thus reduced number of LHS, etc.

The drawback is that we update dynamic objects twice, once in the “static SAP”, once in the normal SAP.

Didn’t try it, don’t know about the perf, but might be interesting.

shopfr.org cialis