FLEXINE - a flexible engine
Random informal unsorted incomplete notes - last updated october 9, 2001
Flexine is my new engine, based on my former ICE framework. Its
main goal is, you guessed it, flexibility. Performance is important as
well, but it's definitely a runner-up here. Basically, I want to be able to
throw whatever I want to the engine, and let it handle the gory details. I want
to be able to test any whacked algorithms in a second. I want to be able to
replay levels from various games - e.g. Quake3 and Oni
-regardless of their respective native data structures (BSP for Q3, octree for
Oni). I want to be able to handle exotic file formats such as ROK files, from
the japanese modeler Rokkaku-Daioh. I want to be able to replace a DX7
rendering module with an OpenGL one the day OpenGL 2.0 is out. I want to be
able to use DX7, DX8 or DX9 in a transparent way. I want that engine to last. I
want it to be future-proof.
What I don't want is to start a new engine from scratch once again, next
year or the year after, because this one was unable to evolve and stick to new
trends. What I don't want is to ban some specific algorithms just because the
architecture can't support them. What I don't want is to be format-dependent.
What I don't want is to build one more eye-candy pusher with nothing
underneath. What I really don't want is to recode the same things again and
again, all over the place.
Various choices have been made in order to reach my goals. Below are some of
them, but keep in mind this is only a tiny informal overview.
Main goals in order of importance:
- Flexibility
- Performance
- Ease of use
- Code reuse
- Portability
Language choices
- everything is written in C++, and makes good use of the language's features.
In 2001, there's no valid reasons to be shy about virtual methods or multiple
inheritance. I accepted them once and for all, and don't want to loose my time
in endless, useless, pointless holy wars about this. In short, it pays off in
the long run and the price to pay is minimal - CPU time wise - as long as you
know what you're doing. If you're a tiny bit serious about professional
development anyway, you already know - and agree with - that.
- more specifically:
- inline assembly has not been ruled out (you know I'm not afraid of it -
worse, I actually like the smell of the metal), but it's only there for very
specific tasks (SIMD or pixel shaders instructions come to mind), and only
included lately - not blindly.
- the project has been cut into different dedicated modules right from the
start. It's always tempting to hardcode some maths in places you need it. But
it's bad. Here, everything is included in a dedicated maths DLL. It is painful
when the project starts, because you get plenty of DLLs with not much in them,
and the whole thing looks overkill. But before you know it you have hundreds of
files in the project, and without the initial separation into multiple modules,
it becomes an infernal mess. Current modules in Flexine are listed below.
ICE basic blocks:
- IceCore.dll
- IceMaths.dll
- IceImageWork.dll
Rendering facilities:
- IceRenderer.dll
- IceDX7Renderer.dll
- IceDX8Renderer.dll
Dedicated modules:
- IceCharacter.dll
- IceTerrain.dll
- IceSound.dll
Computational geometry:
- Meshmerizer.dll
Collision detection:
- Opcode.dll
- ZCollide.dll
Physics:
- ZForce.dll
"Flexine", or the glue using all of the above:
- IceRenderManager.dll
Additional libs:
- zlib
- bzip2
- various formats
Documentation and code-style choices
- I have a very accurate code-style, and I stick to it. Whether it's the
best or not is totally irrelevant, and often people don't see this and start
endless wars about the best way to write readable code. Useless. I don't claim
anything here except coherency. The only rule is to stick to your rules.
- I chose Doxygen as the automatic documentation tool. Some people
reported they discarded it because they didn't like the way they had to write
their comments. I say this is actually the best thing happening with inflexible
documentation tools: you loose the tiny bit of liberty which happens to be a
PITA, responsible for many useless holy wars out there. You don't have anymore
to ponder contradictory things endlessly. Doxygen wants it that way, you write
it that way, period. I have better things to consider.
Design choices
- Code reuse is vital. That's probably my heaviest burden so far. When I write
new code, I simply can't stand not to share the maximum of it, even if the
whole project needs rebuilding afterwards. I also have a very personal strategy
to design my classes: I always do as if someone else was supposed to include it
directly in its own engine. Of course it is not always possible, but
theoretically it leads to cute, independent modules. I already released some of
them just to check that theory - e.g. my triangle stripper, the consolidation
code, or what's now known as Opcode. In short, I think I could extract most ICE
/ Flexine modules in the same way, and one could still include them and use
them in another engine. That's no altruism. That's a way to build neat
interfaces.
- Design patterns are vital. This goes along with code reuse - something like
ideas-reuse maybe. As a single example, the publish-subscribe design pattern is
used extensively not to loop through all your objects in vain.
- Lazy evaluation is vital. As the guys from dPVS/Umbra said, it is the key to
great performance. I couldn't agree more, and my primary design strategy is
always to think "how can I lazy-evaluate this ?".
- Robustness is vital. Long, long years ago, I was a coder-wannabe on Amstrad
CPC. And there was that french magazine (I think it was "AMSTAR")
with some coding lessons in that good old Locomotive Basic. And there was that
first, simple, beautiful rule: "Un bon programme est implantable." -
read: "a good program never crashes". I know it may sound dubious in
those days of random BSODs, but it has happened in the past, and even if it was
on old machines with limited ram, limited users, limited resources, limited
possibilities for the code to crash, it did the job nonetheless. A PC program
can crash for millions of new reasons, but I strongly feel we, as developers,
are responsible for most of them. So I fight for robustness. When I was at
ESME, one of my teachers (Bernard Ourghanlian, technical director at DIGITAL)
clearly pointed out the problem: the actual trend is to write "disposable
software", later superseded by new versions or more-or-less fixed with
patches. We know the reasons for this, and I'm not here to criticize or discuss
them. But I definitely try not to play that game, and it shows in my code where
I often - if not always - try to make any given function bullet-proof. This has
been one of the most required key-features when I started coding ICE, and it
has evolved in a pathological crusade until now. In Flexine for example, as
long as you play by the rules and let ICE monitor everything, you can safely
mess up a lot of things without crashing - ICE's kernel does the cleaning. As a
simple example, if you delete an object anywhere ("delete ptr")
instead of calling a Relase() method, without checking a possible reference
counter, bypassing all safety nets, it usually recovers gracefully anyway. The
object gets wiped out. Then, if you stored that pointer in some container (for
example in a list of meshes, a list of visible objects, whatever), the
container automatically knows one of its elements is now invalid, and removes
it from the list - calling you back for notification if needed. Perhaps even
sicker is the ability of the underlying kernel to fix now invalid pointers all
by itself, even if that pointer lies somewhere as a private member of some
class. The pointer becomes null, automagically - of course you shouldn't have
savagely deleted the referenced object in the first place, the point is it
doesn't crash even if you do stupid things. On top of that, you're usually
supposed not to use pointers anymore, but IDs. The ID-to-pointer translation is
supposed to be done on-the-fly each time it's needed, and the overhead is
virtually free on PC thanks to a carefully crafted implementation.
- Smartness is vital. But not too much of it! Often you build a complex system,
and in the end it runs slower than a brute-force, or at least less clever
approach. Often you optimize one thing, and another thing starts suffering from
the change. It's easy to optimize one particular routine to the max. It's
harder when everything gets interconnected. There's a balance to find, and no
single good way to the top. This is now known as "smartness
wrap-around", following one of my posts on the Algorithms list.
"Everything is precision-limited with computers, even smartness. Too
much of it and you wrap around."
a.k.a.
"Curiosity killed the cat,
Complexity killed the cache !"
So keep profiling, low profile. It's harder and harder to foresee what will
come out best, and one good technic one day becomes bad practice the other day.
Learn, adapt, evolve, survive. "I got no idols", Juliana said
! Beware of self-proclaming gurus. Believe no one but yourself. Yes, that's a
design choice! It leads to flexibility as you don't put all your bets on a
single "best" design or data structure.
- Shipping is vital. Highest priority. Don't
trust a guy who's never shipped anything. Don't. Do not.
Data format choices
- Automatic serialization, usually
implemented thanks to the help of virtual import / export methods and clever
OFFSETOF macros, is cute and handy. Unfortunately it also has two painful
drawbacks: it usually produces large files, and it implies a great dependence
between the file format and your internal classes. On top of that, versioning
is difficult to make automatic.
- In Flexine, I then used the old way. I have
my proprietary format, and also support some other classic formats - in sake
of, you guessed it, flexibility. They all use the same model:
- X-importer class
for a particular X format
- X-factory ISA
X-importer
- scene HASA
X-factory
Usually those factories only fills creation
structures with incoming format-dependent data, then call the standard scene
creation methods. On one hand, that way of dealing with data has a major
drawback: you need to write and maintain a lot of code. On the other hand, it
has several interesting advantages:
- the X-importers
can (and should) be written as independant modules, say static libs. Other
people can re-use them in their own engines, since only the X-factories depend
on my internal classes. In the same way, I can reuse those loaders in other
projects.
- If there are some
changes in the internal classes (even radical ones), already existing art
remains valid. It doesn't need reexporting or anything. The only piece of code
which needs updating is the factory, the actual interface between the data and
the engine. This is not the case with automatic serialization.
-
Some developers build their custom format so that it's optimal for hardware
T&L. While I understand the purpose, I don't think it's very future-proof,
as the nature of hardware-friendly things tends to radically change over the
years. As a single example, there's still no clear answer whether strips are
better than lists or not. As a result, in Flexine I chose to build hardware
friendly data in runtime, on loading or sometimes on-the-fly when needed -
following the lazy-evaluation paradigm. Triangle strips are not stored to disk
but built on-the-fly. Consolidation is performed on demand as well. Care must
be taken to ensure it doesn't slow down the loading process, but in the end it
allows me to keep the art relatively independent of hardware trends. The
transform between on-disk and in-engine data can furthermore be driven by a
runtime performance analysis, to make sure it really fits the underlying
hardware.
- All bets are off when it comes to
streaming, nonetheless. Not all data format support streaming, and you really
need to design things carefully here. In any case, I wouldn't want to support
streaming of automatically serialized classes.....
Rendering choices
- The low level rendering API is currently
DirectX, for better or for worse. I'm not claiming it's better than OpenGL,
let's leave those useless discussions behind. At one point it just seemed to
evolve faster.
- This is a problem as far as compatibility
is concerned. The right way to solve that problem is probably to build a
rendering abstraction layer, upon which you can plug whatever you want. This is
not an easy task, and the trick is probably to design relatively high-level
interfaces. That way you can implement any exposed function with any low-level
API, even if a given function has a native counterpart on one API, and must be
recoded with multiple calls on the other. In Flexine there's a first wrapper
exposing those high-level interfaces, then a particular implementation of those
interfaces for each low-level API (one for DX7, one for DX8, etc).
- This is not a matter of simply wrapping a
DX call with a one-liner "high-level" method. It sometimes made me
rewrite complete DX functionalities. For example, I have my own DX7 version of
what has been later introduced in DX8 under the name "Effects &
Techniques". Exposing the E&T interfaces allows me to use the native
E&T stuff in DX8, but also to keep the same application code working with,
say OpenGL. Wrapping also allows one to minimize redundant render state
changes. Here's a snippet of one of my posts on the GD-Algorithms list about
this:
"One of the reasons I chose to wrap
everything in the first place was to avoid redundant state changes by caching
all states at the app level - hence avoiding to even call DX SetRenderState
methods when not needed. From what I read on the DXML - and you probably know
those bits better than I do - it *should* be useless because the driver
*should* do it as well. But several reality-provided things have since been
added on top of that cute scenario :
- when I started implementing this on DX7 (or was it DX6?) the driver
behaviours were quite random, to say the least. Does that one cache things
correctly? Does that other one filter redundant changes? Uh, nobody was able to
provide a definitive answer, and I doubt such a thing even existed. Better safe
than sorry, wrapping everything was faster than looking for dubious
IHV-provided-answers-yeah-of-course-we-do-it-what-do-you-think.
- if I understand correctly (and since I'm still using DX7 I might be wrong),
you can't use GetRenderState() methods with pure devices. So the only way to
know the current states is to cache them at the app level anyway. Whether you
should have to worry about the current states is another question, the point
is: if you want to know, you have to code the thing anyway.
- but if you start playing with state blocks, it becomes messy : the state
block gets applied and your app-level caches don't know about it. Screwed.
Hence you need to wrap state blocks as well as render states.
- now comes the NVIDIA case: state blocks are good and fine, except... doh, on
NVIDIA cards. What the hell ? What am I supposed to do ? Duplicate my code
everywhere, supporting state blocks or not depending on the card ? Big mess
ahead. Once again, it's way cleaner to do your own state blocks once and for
all, and use them everywhere. On NVIDIA boards, they end up calling the crude
render state methods. On all other boards, they end up using real state blocks.
You can even profile both and choose what's best at runtime.
For all those reasons (which basically come down to a single one: peace of
mind), biting the bullet is, IMHO, worth it. Extra advantages:
- the delta stuff is actually one C++ line:
StateBlock Delta = SB0 - SB1;
- you can use Effects & Techniques files in your DX7 app
- you don't bother anymore about what's best / fastest: the runtime profiler
does that for you for every piece of new hardware you can imagine. I don't want
to put my nose in that code again, that's not interesting, that's vain, that's
painful.
- the wrapper also checks the caps for you - another painful stuff to do IMHO.
etc, etc...
There are two downsides :
- it's a lot of code
- it's admittedly slower than calling the DX methods directly. (and I
definitely don't care, to me it's very very worth the price)"
- Batching and dynamic lighting: here's another of my post from the GDA:
> Does anyone know a decent order in
general for renderstate costliness?
> (including turnng lights on and off)
"Lights can be handled by the collision detection module. I just send both
meshes + lights to the sweep-and-prune, basically. Since I already send meshes
out there to handle all meshes vs meshes collisions, adding some lights to the
call is virtually free. (...as long as they are point lights) Then for each
mesh I keep N lights (usually N=8 but it depends on your hardware). I don't use
a scene graph, I batch & sort everything as you propose (a reasonable way
IMHO - and I don't think there's a "best" way). I usually sort by
texture first, or by material/shader. Sometimes you want to sort by VB instead,
but it depends on your card, geometries, viewpoint, anything. So the best way
is to be able to change the sort keys at runtime, and do a little runtime
analysis to figure out what's best. Usually textures win, but I admittedly only
ever tested this on NVIDIA cards and with limited scenes. Now, if you have a
lot of dynamic lights around, it may be better to sort by "mesh"
first, activate N lights for the mesh, render it - possibly with multiple
texture changes -, desactivate the lights and go on. If you sort by texture
first, you decrease the number of textures switches but
increase the number of light switchs. Since lights activation is really fast,
it may be a win. But since a given mesh may be lit by N lights, you're actually
changing one texture switch for N light switchs, and all in all the winner is
unclear... In any case, the sweep-and-prune initial phase is good to determine
the mesh/lights interactions, regardless of how you handle them afterwards. The
same method is used in Umbra, under the name "Regions Of Influence"
(even if I prefer the good old sweep-and-prune label). It works reasonnably
well anyway. I must have some test scenes with 128 meshes lit by 128 lights,
and determining what is influencing what is virtually free (just a O(n)
algorithm on 256 thingies).
Well, in any case I wouldn't use a scene-graph, and I wouldn't bother too much
about what's best since it has always been evolving/changing over the years, is
currently unclear, and is probably going to change with future versions of DX,
future cards, future whatever. So don't make it "best", make it
flexible. Meanwhile: batch, sort by texture, you'll be fine - correct me people
if I'm wrong, but it seems to work well here.
I also would like to back up Tom's comments about cooking your own state-blocks
wrapper and delta-compressing state changes. This is a lot of code indeed, not
too exciting and even pretty boring. But that's probably one of the best route
I've ever taken as far as rendering is concerned. It just makes life
simpler."
- As in numerous engines, there's a distinction between the actual objects to
render (meshes, but also various helpers) and their rendering properties
(materials, textures, illumination model, in a word: shaders). Basically
you can batch by shader to minimize render state changes, batch by object to
minimize VB / IB switches, or batch by light to apply each of them in a
separate pass - effectively bypassing the usual hardware limit of 8 lights.
You're of course supposed to batch by shader or mesh as well within a light
batch... It gets worse when visibility enters the pictures, as you also want to
render things in a rough front-to-back order to reduce overdraw. And it becomes
really messy when all of this relies on traversing a scene-graph. There's no
perfect pipeline, it all depends on the situation.
- Actual meshes are made of several
submeshes. A submesh is a group of faces sharing the same rendering properties.
Each submesh has the usual geometry and topology, both locked in their
respective vertex and index buffers. Most of the time a system memory copy of
both buffers is kept, and used in various cases such as picking or collision
detection. Hardwired topology is best expressed as indexed triangle strips -
usually with a single degenerate indexed strip for each submesh -, but both
strips and lists are of course supported. The system copy is always a list,
strippified on-the-fly when needed. Vertex and index buffers can be either
static (and optimized under DX7) or dynamic. Most of the time anyway, they're
semi-dynamic: multiple meshes are stored in shared buffers, managed as LRU
caches. Meshes are not stored in hardwired buffers at loading time: they're
actually sent to the renderer only when they become visible for the first time.
All of this is kept under the hood, in the abstract renderer, so that the
process is totally invisible from the app, which doesn't want to know about the
details. The day vertex buffers become obsolete, the rendering interfaces will
hopefully remain valid nonetheless. Now, this management is exposed in a
RenderableSurface at Renderer level, but you're not forced to use it if you
don't want to. You still can access the simpler wrappers and build your own
system.
- The classic MAX model has been followed,
where each face references a material, and each material references one or more
textures. Easy, handy, works well. But it's not flexible enough. The new model
supports E&T-like shaders. They can be compiled from text files or directly
from memory, and they produce byte-code further captured in state blocks. A
fast state block emulation path is provided - which happens to be faster on
NVIDIA cards, in accordance to what Richard Huddy repeated many times on the
DXML.
- DXTC compression is supported simply
because there's no reason not to support it. Reduces bandwidth, speed things
up, best of all worlds.
- Parametric geometry and high-order surfaces
have been taken into account, not for the sake of it, but really because the
more we can draw polygons, the more bandwidth becomes an issue. I don't believe
in B-patchs so far : as long as it's not hardware-accelerated, it only looks
like geometry compression to me. If you want geometry compression, see next
paragraph. N-patches and RT-patches are "supported" simply because I
wrap & expose all DX render states. Theoretically you can setup those
render states in a Flexine shader and it should work like the proverbial charm.
Now, I don't have a GeForce3 or a Radeon to test this! Meanwhile, I have
software subdivision surfaces using the modified Butterfly algorithm. It's been
designed so that it cohabits well with other things like cloth or skinning.
Hence you can subdivide the result of a skinning algorithm on-the-fly, for
example.
- Dedicated geometry compression algorithms
have been implemented, in order to be Internet-friendly. Decompressing a mesh
into a vertex buffer on-the-fly can sometimes be a good move. [more about that
later]
- Shadow maps & shadow volumes are here.
8 different silhouette extraction routines and a software renderer in case
render-to-texture is whacked.
More to come:
Architecture hell
Collision detection choices
Visibility & culling
Even more to say about:
-
LOD &
simulation LOD
-
Terrains
-
Particles
-
Characters
Lookin' for a planet with 96h a day. Someone
?