Archive for the 'Physics' Category

The evolution of PhysX (2/12) - Rigid bodies (box stacks)

Saturday, May 11th, 2013

So let’s start with some basic rigid body simulation: stacks, piles, and mesh collisions. There are two versions of PhysX 3.3 included in these rigid body tests: one using the regular contact generation, and one using the “PCM” method, introduced in 3.3. PCM stands for Persistent Contact Manifold and this algorithm builds contact manifolds iteratively, over several frames. The regular contact generation, on the other hand, builds a full manifold each frame. Both algorithms have pros & cons. Note that as far as I know, both Bullet and Havok have been using the PCM approach for a long time. Anyway the PhysX 3.3 version using PCM is the one marked with “(PVD)” (for completely uninteresting reasons that are too long to explain).

(Click on each screenshot to enlarge them)

—-

The first scene (“LargeBoxStack30”) is a classical box stack, with a 30-boxes-large base. This isn’t terribly useful per-se in a game, but this is a standard test for stability and performance of physics engines.

I am only testing fairly mature engines here so there is little surprise: they all handle this quite well. In terms of performance we see that each successive version of PhyX is a little bit faster than the previous one, which is what we ideally expect to see in all tests. Bullet gives decent results but remains slower than all PhysX versions - and in particular up to 2X slower than PhysX 3.3/PCM.

Nothing very interesting yet in terms of memory usage, all libraries are close, which is expected for such a simple scene.

If you look at the graph closely you will see that a whole bunch of things seem to happen in the first frame, which takes significantly more time to execute than subsequent frames. This is due both to cache effects (nothing is in the cache the first time), and mainly to the initial creation of various scene-related structures (broadphase, etc). This concerns all engines and it only affects the first simulation frame, so we can ignore them in this case.

One interesting thing to note is that this box stack eventually collapses after a certain time, for all engines except PhysX 2.8.4. This is not because PhysX 2.8.4 has more stable algorithms. This is only because it uses the “adaptive force” feature, which was a rather crude hack we used to increase the stability of stacks. It is still available in PhysX 3.x, but we found out it created various side-effects in various cases. Since large stacks aren’t very common in games, we just disabled this feature by default.

Finally, it is worth noting that we have several R&D projects in the pipe, making this stack (and much larger ones) completely stable. I may release a small proof-of-concept demo later, it’s good stuff.

—-

The second scene (“MediumBoxStacks20”) is quite similar, there are now 10 stacks, each of them with a 20-boxes-large base.

Contrary to the larger stack seen before, those stacks don’t collapse in any of the involved engines. Compared to the first one, this test captures performance issues in engines that have a high per-simulation island cost (since there are 10 islands here instead of 1). This does not actually affect any of the engines tested here though, and the profile and performance ratios for this test look similar to the first one. No big surprise there but one must always double-check instead of assuming, right?

In terms of memory usage, we see that PhysX 3.2 is less efficient than the other engines, but the numbers look quite similar otherwise. Comparing memory usage is not always easy since some engines may pre-allocate various buffers to avoid runtime memory allocations, producing an artificially high memory usage number – i.e. they allocate memory that they do not actually need yet. In any case the numbers are roughly similar here. No big winner.

The evolution of PhysX (1/12) - PEEL

Saturday, May 11th, 2013

PEEL is a tool designed to evaluate, compare and benchmark physics engines - it’s “PEEL” for “Physics Engine Evaluation Lab”.

In a way, it is very similar to the old PAL project (Physics Abstraction Layer). But PEEL supports several things that were missing from PAL.

It was initially written to compare PhysX versions between each-other, and catch performance regressions. One of the recurrent questions we got about PhysX 3.x was “how much faster is it than 2.x?”. PEEL was originally built to easily answer this question. However it quickly became clear that adding support for entirely different engines was easy. And so, I did just that.

There is a virtual interface called “PINT” (for Physics INTerface), and an implementation of this interface for each physics engine. A number of tests (more than 250 so far) have been written, and they all talk to this interface. As a result, the same test can run on an arbitrary number of physics engines, as long as they properly implement the interface. This can also be used as a complement to PhysX’ migration guide, for people porting 2.x apps to 3.x: you can just look up how the interface is implemented in both.

At time of writing, PEEL supports:

  • Bullet 2.79
  • Bullet 2.81
  • Havok 6.6.0
  • Havok 2011_3_0
  • Havok 2011_3_1
  • Havok 2012_1_0
  • Havok 2012_2_0
  • ICE Physics
  • NovodeX 2.1.1
  • Opcode 1.3
  • Opcode 2.0
  • PhysX 2.8.4
  • PhysX 3.1
  • PhysX 3.2
  • PhysX 3.3 (various branches of it)
  • GRB (GPU rigid bodies)

PEEL uses all those physics engines and collision libraries in the same application. This was in itself quite an interesting engineering challenge, since you can’t just link to different versions of the same engine without running into compile and/or link errors about ambiguous or redundant symbols, function names, etc. Namespaces don’t really help when different versions of the same lib use the same namespace - or have the same DLL names, for that matter.

To make this work, each engine ended up in its own PEEL-related DLL (containing the implementation of the PINT interface), and each of these “PINT DLLs” is a plug-in for the PEEL application (reminiscent of the format plug-ins from Flexporter, same story).

For physics engines providing static libs, like Bullet, the whole Bullet-related code ends up in a PINT DLL, and that’s it.

For physics engines providing their own DLLs, like PhysX, the PhysX PINT wrapper ends up in a PINT DLL, which in turn loads the PhysX DLLs. Delay loading is used to be able to rename the PhysX DLLs, e.g. in order for the PhysX 3.2 and PhysX 3.3 DLLs to have different names.

The tests cover a wide range of features. There are API tests to just check how the basic API works. There are performance tests. There are behavior tests. There are tests for rigid body scenes, joints, CCD, raycasts, sweep tests, overlap tests, memory usage, for corner cases or degenerate cases, for multi-threaded or single-threaded simulations, etc. There is even a raytracing test rendering the scene from the current camera viewpoint, using the engines’ raycasting functions. This is what I meant above when I said that PEEL supported more things than PAL.

An interesting feature is that all engines run at the same time. All simulations are performed for all selected engines, in the same PEEL frame. Results are rendered on screen with a different color for each engine, making it extremely easy to spot divergences in behavior. Running all engines at the same time also helps to keep the performance numbers “realistic”. In a real game you never run just the physics engine and simple graphics, you have the whole game behind, and a number of subsystems using up resources, trashing the cache, etc. Running all physics engines at the same time replicates this to some extent. In any case, it is always possible to run one engine at a time by just selecting a single one in the initial selection dialog.

Excel graphes with benchmark results can be saved by simply pressing a key. PEEL also supports simple scripts that can run desired tests and save the results automatically.

Each engine has a dedicated UI dialog for options & settings. A variety of “actions” are implemented (picking, shooting new objects, applying impulses to existing objects, etc) to let you interact with the scene and double check that things behave as expected.

—-

So, now that the stage is set, let’s see what PEEL reveals about PhysX. The following results are for PhysX 2.8.4, PhysX 3.2 and PhysX 3.3. For all these versions I just grabbed the code from their respective trunks at the time of writing. It may or may not exactly map to an officially released build of PhysX, but in any case the results should be reliable enough to give a rough idea of how the library evolved from one version to another.

I felt compelled to also include the last version of Bullet (2.81), to provide an objective external reference. The thing is, there are many people out there who still think that “PhysX is not optimized for CPU”, “PhysX does not use SIMD”, “PhysX is crippled on purpose”, and so on. So providing performance graphes for just PhysX would not prove much to them, and they could still pretend that all versions are slow anyway. Thus, adding a well-known engine like Bullet – which is often proposed as an alternative to PhysX by the same naysayers – seemed like a good reality check. (I’m talking to you, the guy who wrote “if you’re going cpu, bullet is much more optimized there“).

I have been very fair here, and recompiled the library with the same optimization flags as PhysX (in fact I even reported on the Bullet forums that the default compile options were not optimal). I also wrote PEEL’s Bullet plug-in as best as I possibly could, but admittedly I may not be a Bullet expert. So I will release the Bullet plug-in source code later, so that you can double-check if I did something wrong.

I could not include the Havok results since, well, it is forbidden to publish them according to their license. Maybe they are afraid of something here, I don’t know.

In any case, all the graphes in the following posts capture the performance of single-threaded mode. Time units are K-Cycles. Sleeping is disabled for all engines.

Realtime fracture demo using GRB

Wednesday, March 27th, 2013

Here’s the sequel to the previous fracture video I posted some time ago:

http://www.youtube.com/watch?v=ATU6IGCMpUA&feature=youtu.be

It’s using “GRB”, which is basically the GPU-version of PhysX.

Precomputed node sorting in BV-tree traversal

Friday, November 23rd, 2012

Here is another post about a small optimization I just came up with. This time the context is BV-tree traversal, for raycasts or sweeps.

So let’s say you have some raycast traversal code for an AABB-tree. If you do not do any node sorting to drive the tree traversal, your code may look like this (non-recursive version):

const AABBTreeNode* node = root;
udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;

while(Nb)
{
node = Stack[--Nb];

if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
}
}

This should be pretty clear. We fetch nodes from the stack, perform segment-AABB tests against the nodes’ bounding boxes.

If the node is a leaf, we test its primitive(s). In case of a hit, we reduce the length of the query segment, to reduce the total number of visited nodes.

If the node is not a leaf, we simply push its children to our stack – in reverse order so that “Pos” gets tested first -, and continue. Easy peasy.

Now this is a version without “node sorting”: we always push the 2 children to the stack in the same order, and thus we will always visit a node’s “Positive” child P before the node’s “Negative” child N. This is sometimes just fine, in particular for overlap tests where ordering does not usually matter. But for raycasts, and especially for sweeps, it is better to “sort the nodes” and make sure we visit the “closest node” first, i.e. the one that is “closer” to the ray’s origin. The reason is obvious: because we “shrink the ray” when a hit is found, if we visit the the closest node P first and shrink the ray there, the shrunk segment may not collide with N at all, and we will thus avoid visiting an entire sub-tree.

Node sorting is not strictly necessary. But it is a good way to “optimize the worst case”, and make sure the code performs adequately for all raycast directions. It has, nonetheless, an overhead, and it is likely to make the best case a little bit slower. A good read about this is the recently released thesis from Jacco Bikker, which contains a nice little code snippet to implement SIMD node sorting for packet tracing.

When dealing with simpler one-raycast-at-a-time traversals, there are usually 2 ways to implement the sorting, depending on your implementation choice for the segment-AABB test. If your segment-AABB test produces “near” and “far” distance values as a side-effect of the test, all you need to do is compare the “near” values. If however you are using a SAT-based segment-AABB, those near and far values are typically not available and an extra distance computation has to be performed. It is not necessary to use a very accurate distance test, so one option is simply to project the nodes’ centers on the ray direction, and use the resulting values. If we modify the code above to do that, we now get something like:

const AABBTreeNode* node = root;

udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;

while(Nb)
{
node = Stack[--Nb];

if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
const Point& BoxCenterP = node->GetPos()->mBoxCenter;
const Point& BoxCenterN = node->GetNeg()->mBoxCenter;
if(((BoxCenterP - BoxCenterN).Dot(RayDir))<0.0f)
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
else
{
Stack[Nb++] = node->GetPos();
Stack[Nb++] = node->GetNeg();
}
}
}
}

The above code could be improved, the branch could be removed, the last push to the stack could be avoided since it will otherwise probably create an LHS, etc. But this post is about node sorting, so I will only focus on this part.

It does not look like much, but it turns out that it can have a very measurable performance impact when the rest of the function is already highly optimized. It fetches the 2 children nodes (cache misses), it has a float compare (very slow on Xbox), and that dot product is annoying.

So, let’s get rid of all of these.

In order to do that, we will need to go back in time a bit. To the days of the painter’s algorithm, before Z-Buffers, when it was mandatory to render opaque polygons back-to-front. At that time even radix-sorting all triangles was considered too slow, so we often just… precomputed the sorting. We had 8 precomputed “index buffers” for 8 possible main view directions, and the whole sorting business became free. There are still various traces of those early algorithms online. This thread mentions both Iq’s version, called “Volumetric sort” and the similar article I wrote some 10 years before that. That was back in 1995, so the idea itself is nothing new.

What is new however, I think, is applying the same strategy to BV-tree traversals. I did not see this before.

So there are 8 possible main view directions. For each of them, and for each node, we need to precompute the closest child. Since we have only 2 nodes in the binary tree, we need only one bit to determine which one is closest, and thus we need 8 bits per node to encode the precomputed sorting. That’s the memory overhead for the technique, and it may or may not be acceptable to you depending on how easy it is to squeeze one more byte in your nodes.

The precomputation part is trivial. A vanilla non-optimized version could look like the following, performed on each node after the tree has been built:

static bool gPrecomputeSort(AABBTreeNode* node)
{
if(node->IsLeaf())
return true;
const AABBTreeNode* P = node->GetPos();
const AABBTreeNode* N = node->GetNeg();
const Point& C0 = P->mBoxCenter;
const Point& C1 = N->mBoxCenter;

Point DirPPP(1.0f, 1.0f, 1.0f); DirPPP.Normalize();
Point DirPPN(1.0f, 1.0f, -1.0f); DirPPN.Normalize();
Point DirPNP(1.0f, -1.0f, 1.0f); DirPNP.Normalize();
Point DirPNN(1.0f, -1.0f, -1.0f); DirPNN.Normalize();
Point DirNPP(-1.0f, 1.0f, 1.0f); DirNPP.Normalize();
Point DirNPN(-1.0f, 1.0f, -1.0f); DirNPN.Normalize();
Point DirNNP(-1.0f, -1.0f, 1.0f); DirNNP.Normalize();
Point DirNNN(-1.0f, -1.0f, -1.0f); DirNNN.Normalize();

const bool bPPP = ((C0 - C1).Dot(DirPPP))<0.0f;
const bool bPPN = ((C0 - C1).Dot(DirPPN))<0.0f;
const bool bPNP = ((C0 - C1).Dot(DirPNP))<0.0f;
const bool bPNN = ((C0 - C1).Dot(DirPNN))<0.0f;
const bool bNPP = ((C0 - C1).Dot(DirNPP))<0.0f;
const bool bNPN = ((C0 - C1).Dot(DirNPN))<0.0f;
const bool bNNP = ((C0 - C1).Dot(DirNNP))<0.0f;
const bool bNNN = ((C0 - C1).Dot(DirNNN))<0.0f;

udword Code = 0;
if(!bPPP)
Code |= (1<<7); // Bit 0: PPP
if(!bPPN)
Code |= (1<<6); // Bit 1: PPN
if(!bPNP)
Code |= (1<<5); // Bit 2: PNP
if(!bPNN)
Code |= (1<<4); // Bit 3: PNN
if(!bNPP)
Code |= (1<<3); // Bit 4: NPP
if(!bNPN)
Code |= (1<<2); // Bit 5: NPN
if(!bNNP)
Code |= (1<<1); // Bit 6: NNP
if(!bNNN)
Code |= (1<<0); // Bit 7: NNN

node->mCode = Code;
return true;
}

Then the traversal code simply becomes:

const AABBTreeNode* node = root;

udword Nb=1;
const AABBTreeNode* Stack[256];
Stack[0] = node;

const udword DirMask = ComputeDirMask(RayDir);

while(Nb)
{
node = Stack[--Nb];

if(TestSegmentAABBOverlap(node))
{
if(node->IsLeaf())
{
if(TestLeaf(node))
ShrinkRay();
}
else
{
if(node->mCode & DirMask)
{
Stack[Nb++] = node->GetNeg();
Stack[Nb++] = node->GetPos();
}
else
{
Stack[Nb++] = node->GetPos();
Stack[Nb++] = node->GetNeg();
}
}
}
}

As you can see, all the bad bits are gone, and node sorting is now a single AND. The “direction mask” is precomputed once before the traversal starts, so its overhead is completely negligible. An implementation could be:

//! Integer representation of a floating-point value.
#define IR(x) ((udword&)(x))

static udword ComputeDirMask(const Point& dir)
{
const udword X = IR(dir.x)>>31;
const udword Y = IR(dir.y)>>31;
const udword Z = IR(dir.z)>>31;
const udword BitIndex = Z|(Y<<1)|(X<<2);
return 1<<BitIndex;
}

And that’s it. Gains vary from one scene to another and especially from one platform to another, but this is another useful trick in our quest to “speed of light” BV-tree traversals.

People are confused

Tuesday, May 15th, 2012

It’s amazing how people are clueless about PhysX. It’s not the first time I read this kind of stuff on forums, but since this one is recent I’ll make a note here:

http://www.gamespot.com/forums/topic/29130252/what-do-you-expect-next-gen-for-dx1211.1-opengl-physx-havok-and-other-apis

“Havok has the advantage on PC for physics since Nvdia GPU accelerated physx is limited to their cards”

This is bullshit, plain and simple. Both Havok and PhysX work perfectly fine in software. PhysX works on anything from PC to PS3/Xbox/Mobile/etc. Basically they are similar libraries doing similar things.

But on top of that, PhysX accelerates some features on Nvidia GPU, while Havok does not. And PhysX is fully free, while you need to pay for Havok in commercial games. PhysX has the advantage here.

There are also differences in terms of performance, memory usage, robustness, features, API user-friendliness, etc. No clear winner in any of those.

Batching incoherent queries

Saturday, April 7th, 2012

For a binary tree, a typical recursive collision detection function usually looks like this (in this example for a raycast):

void MyTree::Raycast(const Ray& ray, const Node* node)
{
// Perform ray-AABB overlap test
if(!RayAABBOverlap(ray, node->GetAABB()))
return;

// Ray touches the AABB. If the node is a leaf, test against triangle
if(node->IsLeaf())
{
if(RayTriangleOverlap(ray, node->GetTriangle())
RegisterHit();
}
else
{
// Internal node => recurse to left & right children
Raycast(ray, node->GetLeftChild());
Raycast(ray, node->GetRightChild());
}

}

This traditional tree traversal code has a few problems:

  • A vanilla version suffers from many cache misses since we keep jumping from one node to its children, and there is no guarantee that those child nodes will be close to us.
  • Prefetching might help. However a typical prefetch instruction has a relatively high latency, i.e. you need to wait a certain amount of cycles after the prefetch instruction has been issued, before the requested address is effectively “free” to read. The problem with a binary tree such as the above is that there is just not enough work to do in one node (basically just one ray-AABB test).
  • The problem is the same on platforms where you need to DMA the next nodes. Even if you start your DMA right when entering the function, it may not be finished by the time you reach the end of it.

There are different ways to combat this issue. You can try a more cache-friendly tree layout, say van Emde Boas. You can try a more cache-friendly tree, say N-ary instead of binary. But those alternatives often come with their own set of problems, and most lack the simplicity of a binary tree.

So instead, I would like to do my Mike Acton and ask: when is the last time you had to raycast one ray against that tree? This is not the typical case. The typical case, the generic case even, is when you end up with many raycasts against the same tree.

This kind of batched queries has been done before for raycasts. I think it is called “packet query”. But as far as I know, it has always been described for “coherent rays” (and only 4 at a time IIRC), i.e. the kind of rays you get in raytracing when firing N rays from the same pixel (say for supersampling), or from close pixels. In other words the rays are very similar to each other, going in the same direction, and thus they are likely to touch the same nodes during the recursive tree traversal.

But where is this restriction coming from? Who said the rays had to be coherent? They do not have to. It is equally simple to collide N incoherent rays against a mesh.

Here is how.

We need a mesh, i.e. a binary tree, and then a linear array of rays. I will give the code first and then comment it. The code looks roughly like this:

void MyTree::MultiRaycast(int nbRays, const Ray* rays, const Node* node)
{
// Collide all rays against current AABB
int Offset;
{
const AABB& Box = node->GetAABB();

Offset = 0;
int Last = nbRays;

while(Offset!=Last)
{
if(RayAABBOverlap(rays[Offset], Box))
{
Offset++;
}
else
{
// Do a ‘move-to-front’ transform on rays that touch the box
Last–;
Swap(rays[Offset], rays[Last]);
}
}
if(!Offset)
return;
}

// Here, Offset = number of rays that touched the box. The array has been reorganized so
// that those rays that passed the test are now at the beginning of the array. The rays
// that did not touch the box are all at the end.

// If the node is a leaf, test surviving rays against triangle
if(node->IsLeaf())
{
const Triangle T = node->GetTriangle();
for(int i=0;i<Offset;i++)
{
if(RayTriangleOverlap(rays[i], T)
RegisterHit();
}
}
else
{
// Internal node => recurse to left & right children
MultiRaycast(Offset, rays, node->GetLeftChild());
MultiRaycast(Offset, rays, node->GetRightChild());
}

}

So first we test all incoming rays against the node’s AABB. When doing so, we reorganize the rays so that the ones that passed the test are now at the beginning of the array. The rays that did not touch the box are all at the end. This is easily done with a simple ‘move-to-front’ operation. Astute readers will probably notice that this is similar to how the tree itself was built in Opcode, when the ‘positive’ and ‘negative’ primitives got reorganized at build time, in a similar fashion.

After the AABB test, if the node is a leaf, we simply test all surviving rays against the node’s triangle. Otherwise we just recurse as usual, with a simple twist: we pass down the number of surviving rays, not the original number.

And that’s it, done. Simple enough, right? The only vaguely subtle conceptual difficulty is how the recursive descent does not invalidate the re-ordering we did in the parent node. This is simply because regardless of how the child nodes reorganize the array, they are only going to reorganize the number we give them, i.e. the surviving rays. And no matter how those guys get shuffled down the tree, when the code comes back from the recursion (say after visiting the left child), we still pass a valid array of surviving rays to the right child. They are not in the same order anymore, but the permutation has only been done on the first half of the array, so we are still good to go.

So what do we get out of this? Well, a lot:

  • We are now working on N rays at a time, not just one. So that is a lot of work to do in each node, and there is plenty of time for your prefetch or your DMAs to finish.
  • The N rays we operate on are stored in a contiguous array, so accessing them is still very cache friendly. It would not be the case if we would have put all our rays in a hierarchy for example, and collided one ray-tree against the mesh.
  • We have N rays at a time, to collide against the same box or the same triangle. This is a perfect SIMD-friendly situation.
  • We traverse the tree only once now, so even if your tree layout is not cache-friendly at all, the number of traversal-related cache misses is minimized overall. In fact you can imagine it has just been divided by the number of batched rays…
  • We traverse each node only once now, so any extra work that you sometimes must do to access the AABB, like dequantization, is now only performed once instead of N times. Think about it: if you raycast N rays against the same mesh, you are dequantizing the top level AABB N times for example. With this new approach however, you do that only once.
  • In a similar way, we fetch a candidate triangle only once and raycast N rays against it. So the penalties associated with getting the triangle data (and there are quite a few, usually) are also reduced when multiple rays end up touching the same triangle.

I illustrated this idea with raycasts, but of course it works equally well with overlap tests or sweeps.

I do not know if it is a new idea or not, but I haven’t seen it described anywhere before.

Fracture demo from GDC

Friday, March 16th, 2012

http://www.youtube.com/watch?v=QxJacxI4_oE&feature=share

New broad-phase demo

Thursday, December 22nd, 2011

We recently had a friendly-yet-furious internal competition at work, where the goal was to create the fastest broad-phase algorithm ever.

We slowly reached the limits of what we can do with a 3-axes sweep-and-prune (SAP), and so we were trying to finally move away from it. This is not very easy, as SAP is very efficient in typical game scenes. Unfortunately as the total number of objects in games increase, and game worlds become more and more dynamic, the SAP performance increasingly becomes an issue.

Several people attacked the problem with several (and very different) approaches. I submitted two entries: one was the multi-SAP (MSAP) broad-phase I previously mentioned; the other one is something I probably cannot talk about yet, so let’s just call it “broad-phase X” for now. Long story short: this last algorithm won.

Out of curiosity, and since I had done it before with MSAP already, I tried it out in Bullet’s “CDTestFramework” to compare it against Bullet’s algorithms. On my machine the new broad-phase is a little bit more than 4X faster than any of Bullet’s versions, and more than 2X faster than MSAP. Of course the relative performances of all those algorithms vary with the total number of objects in the scene, how many of them are static, how many of them are moving, how fast they are moving, etc. So there isn’t usually a clear winner for all possible scenarios. Still, at work we tested those new algorithms in dozens and dozens of scenes with different characteristics, and this “broad-phase X” always performed remarkably well.

The demo is straight out of Bullet 2.78, compiled in Release mode with full optimizations and the SSE2 flag enabled.

Click here to download

The new broad-phase should be available in PhysX 3.3. Stay tuned! :)

Static/Bipartite SAP

Wednesday, September 14th, 2011

Another idea for optimizing the SAP when we have a small amount of dynamic objects moving amongst a large amount of static objects:

Put the dynamics in their own vanilla SAP. Get the dynamic-vs-dynamic overlaps from there, nothing new.

Put the dynamics & statics in a special SAP (“Static SAP”, or “Bipartite SAP”) modified to only report dynamic-vs-static overlaps:

  • The statics are stored in the SAP as usual (same box/endpoint structures as before)
  • The dynamics however are stored in a separate array, using a new box structure.
  • When the dynamic objects are inserted in the SAP, their endpoints are not inserted into the same array as for statics. They are stored directly in the new box structure, i.e. in the (dynamic) box class itself. They link to the static arrays of endpoints (their virtual position within those arrays, if they would have been inserted there as usual)
  • When running “updateObject” on those dynamic objects, we move their endpoints virtually instead of for real. That is, we still look for the new sorted positions within the arrays of endpoints, but we don’t actually shift the buffers anymore. The update is a const function for the main SAP data-structure, the only thing that gets modified is the data within the dynamic box we’re updating. In other words we don’t have real “swaps” anymore. No more memmoves, thus reduced number of LHS, etc.

The drawback is that we update dynamic objects twice, once in the “static SAP”, once in the normal SAP.

Didn’t try it, don’t know about the perf, but might be interesting.

Multi-SAP redux

Saturday, August 27th, 2011

Some years ago I wrote a bit about the Multi-SAP approach in my SAP document. I think I also said I would release some code, but never did. Erwin Coumans tried to implement one version in Bullet, but it’s still marked as “experimental” some years later, and it looks a bit like everybody gave up on that technique - including us in PhysX 3.0.

Well I didn’t.

I still think it’s a very nice method, and very easy to implement once you have a regular sweep-and-prune working. So I’m restarting this effort, refreshing my old prototype, adding some new tricks learnt since last time, etc.

As a teaser, and to show non-believers that yes, it rocks, I quickly added it to my old CDTestFramework, as found in Bullet 2.78. I’ll let you guys test it on your respective machines, but at least here at home, it beats the alternative algorithms in Bullet’s default scene.

msap_demo

Stay tuned…

shopfr.org cialis