Here is another post about a small optimization I just came up with. This time the context is BV-tree traversal, for raycasts or sweeps.
So let’s say you have some raycast traversal code for an AABB-tree. If you do not do any node sorting to drive the tree traversal, your code may look like this (non-recursive version):
const AABBTreeNode* node = root;
udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;
while(Nb)
{
node = Stack[--Nb];
if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
}
}
This should be pretty clear. We fetch nodes from the stack, perform segment-AABB tests against the nodes’ bounding boxes.
If the node is a leaf, we test its primitive(s). In case of a hit, we reduce the length of the query segment, to reduce the total number of visited nodes.
If the node is not a leaf, we simply push its children to our stack – in reverse order so that “Pos” gets tested first -, and continue. Easy peasy.
Now this is a version without “node sorting”: we always push the 2 children to the stack in the same order, and thus we will always visit a node’s “Positive” child P before the node’s “Negative” child N. This is sometimes just fine, in particular for overlap tests where ordering does not usually matter. But for raycasts, and especially for sweeps, it is better to “sort the nodes” and make sure we visit the “closest node” first, i.e. the one that is “closer” to the ray’s origin. The reason is obvious: because we “shrink the ray” when a hit is found, if we visit the the closest node P first and shrink the ray there, the shrunk segment may not collide with N at all, and we will thus avoid visiting an entire sub-tree.
Node sorting is not strictly necessary. But it is a good way to “optimize the worst case”, and make sure the code performs adequately for all raycast directions. It has, nonetheless, an overhead, and it is likely to make the best case a little bit slower. A good read about this is the recently released thesis from Jacco Bikker, which contains a nice little code snippet to implement SIMD node sorting for packet tracing.
When dealing with simpler one-raycast-at-a-time traversals, there are usually 2 ways to implement the sorting, depending on your implementation choice for the segment-AABB test. If your segment-AABB test produces “near” and “far” distance values as a side-effect of the test, all you need to do is compare the “near” values. If however you are using a SAT-based segment-AABB, those near and far values are typically not available and an extra distance computation has to be performed. It is not necessary to use a very accurate distance test, so one option is simply to project the nodes’ centers on the ray direction, and use the resulting values. If we modify the code above to do that, we now get something like:
const AABBTreeNode* node = root;
udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;
while(Nb)
{
node = Stack[--Nb];
if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
const Point& BoxCenterP = node->GetPos()->mBoxCenter;
const Point& BoxCenterN = node->GetNeg()->mBoxCenter;
if(((BoxCenterP - BoxCenterN).Dot(RayDir))<0.0f)
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
else
{
Stack[Nb++] = node->GetPos();
Stack[Nb++] = node->GetNeg();
}
}
}
}
The above code could be improved, the branch could be removed, the last push to the stack could be avoided since it will otherwise probably create an LHS, etc. But this post is about node sorting, so I will only focus on this part.
It does not look like much, but it turns out that it can have a very measurable performance impact when the rest of the function is already highly optimized. It fetches the 2 children nodes (cache misses), it has a float compare (very slow on Xbox), and that dot product is annoying.
So, let’s get rid of all of these.
In order to do that, we will need to go back in time a bit. To the days of the painter’s algorithm, before Z-Buffers, when it was mandatory to render opaque polygons back-to-front. At that time even radix-sorting all triangles was considered too slow, so we often just… precomputed the sorting. We had 8 precomputed “index buffers” for 8 possible main view directions, and the whole sorting business became free. There are still various traces of those early algorithms online. This thread mentions both Iq’s version, called “Volumetric sort” and the similar article I wrote some 10 years before that. That was back in 1995, so the idea itself is nothing new.
What is new however, I think, is applying the same strategy to BV-tree traversals. I did not see this before.
So there are 8 possible main view directions. For each of them, and for each node, we need to precompute the closest child. Since we have only 2 nodes in the binary tree, we need only one bit to determine which one is closest, and thus we need 8 bits per node to encode the precomputed sorting. That’s the memory overhead for the technique, and it may or may not be acceptable to you depending on how easy it is to squeeze one more byte in your nodes.
The precomputation part is trivial. A vanilla non-optimized version could look like the following, performed on each node after the tree has been built:
static bool gPrecomputeSort(AABBTreeNode* node)
{
if(node->IsLeaf())
return true;
const AABBTreeNode* P = node->GetPos();
const AABBTreeNode* N = node->GetNeg();
const Point& C0 = P->mBoxCenter;
const Point& C1 = N->mBoxCenter;
Point DirPPP(1.0f, 1.0f, 1.0f); DirPPP.Normalize();
Point DirPPN(1.0f, 1.0f, -1.0f); DirPPN.Normalize();
Point DirPNP(1.0f, -1.0f, 1.0f); DirPNP.Normalize();
Point DirPNN(1.0f, -1.0f, -1.0f); DirPNN.Normalize();
Point DirNPP(-1.0f, 1.0f, 1.0f); DirNPP.Normalize();
Point DirNPN(-1.0f, 1.0f, -1.0f); DirNPN.Normalize();
Point DirNNP(-1.0f, -1.0f, 1.0f); DirNNP.Normalize();
Point DirNNN(-1.0f, -1.0f, -1.0f); DirNNN.Normalize();
const bool bPPP = ((C0 - C1).Dot(DirPPP))<0.0f;
const bool bPPN = ((C0 - C1).Dot(DirPPN))<0.0f;
const bool bPNP = ((C0 - C1).Dot(DirPNP))<0.0f;
const bool bPNN = ((C0 - C1).Dot(DirPNN))<0.0f;
const bool bNPP = ((C0 - C1).Dot(DirNPP))<0.0f;
const bool bNPN = ((C0 - C1).Dot(DirNPN))<0.0f;
const bool bNNP = ((C0 - C1).Dot(DirNNP))<0.0f;
const bool bNNN = ((C0 - C1).Dot(DirNNN))<0.0f;
udword Code = 0;
if(!bPPP)
Code |= (1<<7); // Bit 0: PPP
if(!bPPN)
Code |= (1<<6); // Bit 1: PPN
if(!bPNP)
Code |= (1<<5); // Bit 2: PNP
if(!bPNN)
Code |= (1<<4); // Bit 3: PNN
if(!bNPP)
Code |= (1<<3); // Bit 4: NPP
if(!bNPN)
Code |= (1<<2); // Bit 5: NPN
if(!bNNP)
Code |= (1<<1); // Bit 6: NNP
if(!bNNN)
Code |= (1<<0); // Bit 7: NNN
node->mCode = Code;
return true;
}
Then the traversal code simply becomes:
const AABBTreeNode* node = root;
udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;
const udword DirMask = ComputeDirMask(RayDir);
while(Nb)
{
node = Stack[--Nb];
if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
if(node->mCode & DirMask)
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
else
{
Stack[Nb++] = node->GetPos();
Stack[Nb++] = node->GetNeg();
}
}
}
}
As you can see, all the bad bits are gone, and node sorting is now a single AND. The “direction mask” is precomputed once before the traversal starts, so its overhead is completely negligible. An implementation could be:
//! Integer representation of a floating-point value.
#define IR(x) ((udword&)(x))
static udword ComputeDirMask(const Point& dir)
{
const udword X = IR(dir.x)>>31;
const udword Y = IR(dir.y)>>31;
const udword Z = IR(dir.z)>>31;
const udword BitIndex = Z|(Y<<1)|(X<<2);
return 1<<BitIndex;
}
And that’s it. Gains vary from one scene to another and especially from one platform to another, but this is another useful trick in our quest to “speed of light” BV-tree traversals.