Let’s face it, performance on modern processors (be it PCs, consoles or mobiles) is mostly governed by memory access patterns. Still, data-oriented design is considered something new and novel, and only slowly creeps into programmers’ brains, and this really needs to change. Having co-workers fix your code and improving its performance really is no excuse for writing crappy code (from a performance point-of-view) in the first place.
This post is the first in an ongoing series about how certain things in the Molecule Engine are done in a data-oriented fashion, while still making use of OOP concepts. A common misconception about data-oriented design is that it is “C-like”, and “not OOP”, and therefore less maintainable – but that need not be the case. The concrete example we are going to look at today is how to organize your mesh data, but let’s start with the pre-requisites first.
Data-oriented design
Much has been said and written about data-oriented design already, and there’s some nice resources on the web. Mike Acton (Technical Director at Insomniac Games) is a well-known advocate of data-oriented design, and has some interesting slides on the web as well.
I will not go into detail about data-oriented design in general, so let me quickly sum up what for me data-oriented design is all about:
- Think about data first, and code second. Class hierarchies aren’t important, but data access patterns are.
- Think about how data in your game is accessed, how it is transformed, and what you end up doing with it, e.g. particles, skinned characters, rigid bodies, and tons of other examples.
- When there’s one, there’s many. Think in streams of data.
- Be aware of the overhead of virtual functions, pointers to functions, and pointers to member functions.
Just to make everybody aware of the importance of point 1 and point 2, consider a very simple example:
char* data = pointerToSomeData;
unsigned int sum = 0;
for (unsigned int i=0; i<1000000; ++i, ++data)
{
sum += *data;
}
In the above example, we take the sum of a million bytes, nothing more. No fancy tricks here, no hidden C++ costs. On my PC this loop takes about 0.7ms.
Let’s slightly alter the loop, and measure its performance again:
char* data = pointerToSomeData;
unsigned int sum = 0;
for (unsigned int i=0; i<1000000; ++i, data += 16)
{
sum += *data;
}
The only thing that has changed is that we are summing every 16th element now, that is we have changed ++data to data += 16. Note that we still take the sum of exactly 1 million elements, only different elements this time!
The time this loop took? 5ms. Let me spell that out for you: This loop is more than 7 times slower than the original, eventhough we’re accessing the same amount of elements. Performance is all about memory access, and how to make good use of your processor’s caches.
Remember point 3, “When there’s one, there’s many. Think in streams of data.”? What does this mean? It’s simple: if you have one texture, mesh, sound sample, rigid body, etc. in your game, you’re going to have many of them. Write your code in a way that it can deal with many of them at once – think in terms of streams of objects. That doesn’t mean that you need to get rid of your Mesh and Texture classes, and turn them into a MeshList and TextureList, respectively. This will be dealt with in our later example.
Last but not least, point 4: “Be aware of the overhead of virtual functions, pointers to functions, and pointers to member functions.” is something that is very close to my heart. It is something often ignored (or forgotten about) by less experienced programmers, and I always make sure to teach my students the underlying costs of virtual function calls and the likes. Still, sooner or later they will end up being used in an inner loop, and will kill your performance.
To state it bluntly, you should never use a virtual function, a pointer-to-function, or a pointer-to-member-function on a per-object basis – not for each primitive group of a mesh, not for each particle in your particle system, not for each texel in a texture; you get the idea. They will likely cause both instruction and data cache misses, and kill your performance – be aware of it!
Dealing with static mesh data in a data-oriented way
Having said that, let’s start with today’s example of how to deal with static mesh data in a data-oriented way. There’s certainly many ways in which you can organize your data, so what we’re interested in is the difference in performance between a completely object-oriented design, and a design more concerned about data access patterns.
The scenario of our example is the following:
- We want to render 500 static meshes, and perform view-frustum culling on them.
- Each mesh has a vertex buffer, index buffer, and about 3 sub-meshes on average.
- Each sub-mesh stores a starting index and the number of indices used in the vertex buffer/index buffer of the mesh it belongs to. Additionally, each sub-mesh has a material and an axis-aligned bounding-box.
- A material contains a diffuse texture, and a light map.
Certainly nothing extremely complicated, which makes it easier to see the difference between the two approaches.
The by-the-book OOP approach
Let’s start with the class design of the object-oriented approach:
class ICullable
{
public:
virtual bool Cull(Frustum* frustum) = 0;
};
class SubMesh : public ICullable
{
public:
virtual bool Cull(Frustum* frustum)
{
m_isVisible = frustum->IsVisible(aabb);
}
void Render(void)
{
if (m_isVisible)
{
m_material->Bind();
context->Draw(m_startIndex, m_numIndices);
}
}
private:
unsigned int m_startIndex;
unsigned int m_numIndices;
Material* m_material;
AABB m_boundingBox;
bool m_isVisible;
};
class IRenderable
{
public:
virtual void Render(void) = 0;
};
class Mesh : public IRenderable
{
public:
virtual void Render(void)
{
context->Bind(m_vertexBuffer);
context->Bind(m_indexBuffer);
for (size_t i=0; i<m_subMeshes.size(); ++i)
{
m_subMeshes[i]->Render();
}
}
private:
VertexBuffer* m_vertexBuffer;
IndexBuffer* m_indexBuffer;
std::vector<SubMesh*> m_subMeshes;
};
Hopefully, some of you will already cry at the sight of such a design. One might think that the above is an exaggeration of OOP design gone wild, but let me assure you that I’ve seen far worse in shipped games. The contender for the worst example was something along the lines of the following:
class GameObject : public IPositionable, public IRenderable, public IScriptable, ...
I cannot remember the exact details, but the actual implementation inherited from 6 or 7 classes, some of them having implementations, others just having pure virtual functions, others having virtual functions for the sole purpose of making dynamic_casts work on them… I don’t want to point fingers, but rather reinstate that such designs can be found in shipped, commercial games and are not made-up examples – but let’s not digress.
Back to the original example, what is so bad about it? Well, there’s several things:
- The two interfaces IRenderable and ICullable cause a virtual function call hit on each call to Render() and Cull(), respectively. Furthermore, virtual functions almost never can be inlined, further deteriorating performance.
- Bad memory access patterns when culling meshes. In order to read the AABB, a whole cache line (64 bytes on modern architectures) will be read into the processor’s cache, whether you need it or not. This boils down to accessing more data than necessary, which is bad for performance, as could be seen in our very first simple example (summing 1 million bytes).
- Bad memory access patterns when rendering sub-meshes. Each sub-mesh checks its own m_isVisible flag, which again will pull more data into the cache than necessary. Same bad behaviour as in the example above.
- Bad memory access patterns in general. Points 3 and 4 above assume that all the meshes’ and sub-meshes’ data is linearly laid out in memory. In practice, this is seldom the case, because programmers often just slam a new/delete in there, and be done with it.
- Bad memory access patterns for future algorithms. As an example, in order to render geometry into a shadow-map, no materials need to be bound, only the start index and number of indices are needed. But each time we access them, we pull the other data into the cache as well.
- Bad threadability. How would you perform the culling process of several sub-meshes on different CPUs? How would you perform the culling on SPUs?
So how can this situation be improved? How can we organize our data so that it offers good performance now and in the future?
The data-oriented design approach
The obvious things to get rid of are the two interfaces, and the virtual function calls associated with them. Believe me, you don’t need those interfaces.
The other thing we can change is to make sure that data that needs to be accessed together actually is allocated next to each other in memory. Furthermore, it would be nice if this was somehow done implicitly, without the programmer having to worry about memory allocations that much.
Going down the DOD route, such a design could look like this:
struct SubMesh
{
unsigned int startVertex;
unsigned int numIndices;
};
class Frustum
{
public:
void Cull(const float* in_aabbs, unsigned bool* out_visible, unsigned int numAABBs)
{
// for each AABB in in_aabbs:
// - determine visibility
// - store visibility in out_visible
}
};
class Mesh
{
public:
void Cull(Frustum* frustum)
{
frustum->Cull(&m_boundingBoxes[0], &m_visibility[0], m_boundingBoxes.size());
}
void Render(void)
{
context->Bind(m_vertexBuffer);
context->Bind(m_indexBuffer);
for (size_t i=0; i<m_visibleSubMeshes.size(); ++i)
{
if (m_visibility[i])
{
m_materials[i]->Bind();
const SubMesh& sm = m_subMeshes[i];
context->Draw(sm.startIndex, sm.numIndices);
}
}
}
private:
VertexBuffer* m_vertexBuffer;
IndexBuffer* m_indexBuffer;
std::vector<float> m_boundingBoxes;
std::vector<Material*> m_materials;
std::vector<SubMesh> m_subMeshes;
std::vector<bool> m_visibility;
};
Let’s examine this approach’s characteristics:
- Good memory access patterns when culling sub-meshes. In order to cull N elements, exactly N*6 floats are accessed in memory. Visibility is written to a seperate destination in memory, so no overhead there either.
- Good memory access patterns when rendering sub-meshes. All the data needed is stored contiguously in memory, and we only load data into the cache which is actually needed.
- Good memory access patterns for future use. If we want to render sub-meshes into a shadow-map, we only need to access m_subMeshes, nothing more.
- Good memory access patterns in general. Memory need not be allocated manually in order for the elements to be contiguous. A std::vector always is, and so is a simple array. You can go further and put all your Mesh instances into a seperate container as well, if you want.
- Easier threadability. If culling (or any other operation, for that matter) needs to be split across multiple threads (or SPUs), we can divide the data into chunks, and feed each chunk of data to different threads. No synchronization, mutexes, or similar needed in those cases, and the solution scales almost perfectly, making it easy to take advantage of 4-core, 8-core or N-core processors, which will be the future of PCs and consoles.
I don’t claim that the presented solution is totally perfect – it certainly isn’t, it just served as an example. Points of improvement from the top of my head:
- Instead of storing bools into the out_visible array, we could directly store the sub-meshes’ data, saving us both the branch and further indirections when rendering the visible meshes in Mesh::Render().
- Taking it even further, we can directly store rendering commands into the GPU buffer if we want, e.g. by writing into a command-buffer on the PS3.
- If your content pipeline supports it, bake all the static data into one single, big Mesh. All the static mesh data for a whole level is then guaranteed to be contiguous in memory, so you can reap the benefits of the above solution automatically.
What performance improvement can we expect to see from such a design change? In the experiments I conducted (500 meshes, 3 sub-meshes each, not counting Direct3D11 API call overhead), the difference between the two presented solutions was about 1.5ms.
If that doesn’t sound like much, consider that 1.5ms actually make up almost 10% of a 60Hz frame. Furthermore, we’re talking about 500 meshes here, and have only touched the design of static mesh data. If data-oriented design is applied throughout the whole engine, the performance benefits can be enormous.
And in case you wondered, of course the data-oriented design won. Let’s end this post by saying that you need to start caring about your data access patterns now, I mean it. So does Scott Meyers.
This sounds all really nice in theory, but for me there is one essential question unanswered: How to access these meshes from outside/user code and support also adding and removing of new mesh instances? Also if you want to support some kind of streaming you need this entity management. One may solve this problem by implementing some kind of “smart-handles” which store an index to address these “property-rows”.
In this implementation I also see a little bit of memory wastage, because you store four std:vectors for your properties. Here you can implement some kind of “property vector” template which stores arrays of properties like your bounding box and so on, combined with the basic std::vector functionality. So you can generalize the SoA principal.
Cheers,
Michael
Let me try to answer your questions one at a time.
1) The Mesh class shown is for static meshes, which are all compiled/baked in the content pipeline. Nothing dynamic there, the amount of memory needed is fixed, allocated once, and all the data is directly loaded into memory. So there isn’t any functionality for adding/removing meshes in this class.
For dynamically created meshes (spawning entities in scripts, for example), I use an entirely different data structure, because you need to ensure that memory doesn’t get fragmented during game-play. You can either use a pool allocator for that (using implicit free-lists and other tricks), or just use an array, and swap the last element with the one to be removed, so everything stays contiguous in memory. That”s what I use, because the copy is done only once, and copying a few bytes doesn’t hurt much, but has the nice property of keeping everything next to another in memory. Of course you need to call destructors manually, and make use of placement new, but that can be handled easily.
2) Streaming is handled in an entirely different way. Streaming tends to fragment memory, hence its best to set aside some amount of memory which is used for streamed resources, and defragment that memory on-the-fly. That can be done by either implementing a smart defragmentation scheme, or by cuting up all streaming resources into pieces of 8/16/32 KB. The latter makes defragmentation a no-brainer (all that needs to be done is fill holes with a simple memcpy), and is what Naughty Dog uses for streaming on the PS3, from what I know.
3) You’re absolutely right about using some kind of handle or smart-pointer. In Molecule, I don’t return raw-pointers to resources, but use another level of indirection instead, which makes it possible to track dangling pointers and other problems I’ve encountered in the past. Especially dangling pointers led to a lot of crashes in completely unrelated code, because memory was overwritten someplace else.
Furthermore, Molecule supports hot-reloading for every kind of resources in the running game, and handles make it far easier to support something like that.
I didn’t want to clutter the example with details like handles, and so used raw-pointers instead.
4) I’ve used std::vectors in the example only, again to make the code clearer for most of the readers, I hope. You’re right that this wastes memory, but that is something I wouldn’t really be concerned about much (it’s a few bytes, but it can sum up). What for me is even more important is that a std::vector can grow, which is something I absolutely don’t want to happen anywhere in my run-time code.
I don’t use any STL container in the run-time code – no std::string, no std::vector, no std::map. One of the reasons is given above, but the other important thing people tend to forget about is that you cannot move them in memory, and you can’t directly memcpy into them, eventhough it might work on a particular STL implementation, but those are different across platforms. So a blob or other data structure is the way to go. Using plain old arrays in the example above would be a perfectly valid approach.
I hope that clears things up, please let me know if there’s further questions!
Wow thanks for clarifying these points so in detail!
1-2) So you are separating the management of static, dynamic and streamed meshes resources? The entirely static mesh management you only need to support a other kind of game genre aside of the big ones which needs streaming or I’m missing smth?
In our engine we combine both approaches (static world and streaming) by using paging. So if you don’t need a huge world you only create one “world page” and thats it. To avoid the memory fragmentation we also use free-lists and pools in each subsystem which hold the actual components.
Could you please explain the idea of splitting up resources into chunks a bit more?
4) I think it really depends on the situation. I also strictly banned most of the STL containers like lists and so on, but I think there are situations where you won’t be able to pre-estimate the size of e.g. your mesh pools. Again there comes the example of “streaming of world pages” into my mind.
Certainly one could only work with fixed size arrays driven by a config file and change this from game to game, but this isn’t very comfortable for me.
I also recently switched to the EASTL which have some nice hiperf containers on board (like the fixed containers).
1-2) Yes, they are separate. But they still can live in memory next to each other, the only difference is that their allocation scheme differs. Molecule’s memory system (explained in other posts) makes it really easy to e.g. say:
HeapArea meshHeap(64*1024*1024);
LinearArena staticMeshArena(meshHeap, whatEverSizeYouWant);
PoolArena dynamicMeshArena(meshHeap.GetMemory() + whatEverSizeYouWant);
And then you just allocate from whichever arena you want. Which allocators/arenas are used is completely up to the user, you just hand those to the engine classes. So you could use pools for both if you want, or allocate in an entirely different way if you need.
About streaming, IMO only the *really* big games (GTA, RDR, Uncharted, and the likes) need a purely streamed world. And there’s many of reasons to still have static meshes in the engine, like e.g. things which are turned on/off during game-play, which could live in memory, be loaded once, and rendered at a later point in time. Furthermore, the decision of either using a purely streamed world, or throwing in a loading-screen in-between levels is up to the user (engine clients).
4) You’re right, there are situations where you cannot pre-estimate the size of some allocation. Yet many times, you can work around that by using a slightly different allocator. Let me try to explain this with a simple example:
Assume our game needs to load some assets which it needs all the time (menu screens, GUI textures, whatever). These have application lifetime. Furthermore, the game is strictly separated into levels, so each level loads its resources, and frees them whenever the player leaves the level. These resources have level-lifetime.
One possible allocation strategy would be to have two linear allocators, each of them using a fixed size, e.g. having two LinearArena allocators. The problem with that approach is that as soon as you need more memory in either of the two, they need to be adjusted.
However, if you were to use a stack-based allocator instead, you could define it’s size once, and both application and level-lifetime allocations use that allocator. Then it doesn’t matter so much which one of the two allocations need more memory, as long as they fit into the fixed budget – and especially in console games, you *have* to deal with fixed budgets, otherwise you’ll hunt down fragmentation and other kinds of memory problems at the end of the project. Been there, done that. Never gonna do it again.
Even if you cannot estimate a fixed size for some allocations, you need to define an upper bound, e.g. you cannot just say “stream everything” without having defined how much memory there is to spare for the whole streaming system. So define an upper bound first, and carefully think about the allocation scheme second. That has worked really well for me, but your mileage may vary.
I cannot state that enough, but if you *really* think about memory allocations carefully, you very seldomly need a completely dynamic and general-purpose memory allocator. IIRC, all Insomniac Games have clearly defined memory budgets in which certain allocations must fit – and these games are by no means small games.
In addition, keep in mind that every console has extra memory for development which you can always use as a fallback for failed memory allocations (in conjunction with a big, fat error message), which makes it more practical to keep allocations fixed-size, and adjust their size every now and then in the config-file. I would suggest using a similar scheme on PC as well (I do).
About splitting up resources into chunks: The idea is to split *all* resources into equally-sized chunks, e.g. 512kB. This means that you can use a simple pool-allocator, and never worry about memory fragmentation when loading/unloading resources in the game.
However, this means that all resource data needs to be laid out in a manner that permits division into such equally-sized chunks. This in turn means that you cannot have any resource which spans more than 512kB of contiguous space in memory, which could be hard to achieve if you’re not careful.
The benefit of such a scheme is a) not having to worry about fragmentation and b) exceptionally fast streaming if the chunk-size is a multiple of the OS’s I/O size/media sector size.
Personally, having to divide everything into 512kB chunks sounds like a risky drawback to me, but Naughty Dog did Uncharted that way, so…
First thank you for the in-deep explanation of this heap management strategy and I think I’ll definitively improve ours in this way. I’ve read a similar strategy in the book “Game Engine Architecture” from Jason Gregory.
The benefit of using a config file to specify memory budgets explicitly is to also support consoles and IMHO the PC would also be more happy with such a memory layout.
Using resource chunks to optimally use these allocators is a great idea, but I’m also a bit scary with this approach. But let’s try it out!
You’re welcome!
Neither the linear, stack nor double-ended stack allocators are something entirely new, games already used these allocation strategies back in the olden days of PS1/PS2/Xbox/etc. But it’s nice that Jason Gregory explains them in his book – my first contact with those allocators was when I started working on my first console title. Coming from the PC, I’ve never heard about those allocators before
. And yes, the PC *is* more happy with such a memory layout.
You can find more details about the resource chunks scheme in Jason’s book, by the way.
While I’m certainly not disputing the points here – doing things in groups is better than not doing them in groups, virtual function calls can hurt your performance, etc. – let me get this straight.
You got a 1.5 ms boost for 500 meshes of 3 submeshes each, on PC. Assuming a 2 GHz core, it’s 3M cycles for 1500 submeshes, or 2K cycles of overhead for each submesh.
The actual overhead that I can see from the provided code is:
- virtual function call overhead (indirect call, basically – cache misses in this case disappear after the first call, and even branch prediction may work)
- additional cache misses – i.e. for ‘nothing is visible’ case we have ~1 cache miss for each submesh in the original case (64b cache line) and ~1/2 cache misses for each submesh in the optimized case
- some function call overhead, maybe some setup overhead in frustum culling function (this is unlikely though)
I don’t see how these combined would account for 2K cycles of overhead for each submesh. Even on a console this translates to 3K cycles of overhead per submesh which is something like 5 cache misses – I can’t see these either; in fact, for 3 submeshes per mesh the data cache miss count OTOH does not seem too different over 500 meshes in these two cases.
Is the actual code very different from the examples here, or am I missing something?
A very good and interesting question!
One of the differences between the two given examples is that the first one stores pointers to sub-meshes, while the later stores them contiguously (a std::vector of SubMesh* in the first case, and a std::vector of SubMesh in the second case).
For the experiments I did, each SubMesh* was allocated independently, not necessarily living next to each other in memory. This causes a lot more cache misses, and I admit that it is not completely clear from my post.
Furthermore, the virtual function call overhead only disappears if we only ever deal with Mesh*, but in a design like the given one, most of the time you will hold an array to a certain number of IRenderables, and render them in order, causing a lot more cache misses in turn. This could be somewhat alleviated by sorting the array by type, but I’ve never seen anybody do this when dealing with base class pointers only. To clear that up, in the profiling I did, I’ve actually used more classes than just Mesh and SubMesh, which caused a lot more virtual function overhead – maybe that wasn’t stated clearly enough.
Thanks for asking, I’m sorry I didn’t explain those two things more thoroughly in the post.
Question, why are you using 4 separate vectors to store the bouncing boxes, materials, sub-meshes and visibility flags for the Mesh, would you not have cache-misses, since the data is dynamically stored and not stored linearly, or am I mistaken? i.e. Instead of:
std::vector m_boundingBoxes;
std::vector m_materials;
std::vector m_subMeshes;
std::vector m_visibility;
Wouldn’t this be better?
struct ImplMeshData
{
float* boundingBoxes; // vector?
bool isVisible;
Material* material;
SubMesh subMesh;
// padding, if required
};
std::vector m_data;
Having them in separate vectors (or preferably a contiguous block of memory) only pulls in the data you actually need, so you have less cache misses than with the usual AoS (array-of-structures) approach.
If you only need the bounding boxes for culling, and store all the meshes’ data in a vector like in your example, you’re fetching all the other data (isVisible, material, subMesh) from memory, even though you don’t require it. Remember that a CPU always reads in cache-lines, so you pull in lots of unused data.
On the other hand, if you store them separately, you only pull in the data you need for your specific use case. Culling would only touch the bounding boxes and the visibility flags, nothing more. Rendering would only touch materials and submeshes. That’s the difference between a AoS and SoA (structure-of-arrays) approach.
Note that in practice I would use arrays and allocate the memory for all of them contiguously, but I didn’t want to clutter the post with such details, but keep concentrating on the main points.
Could you explain a little more about how your data oriented Mesh class might interoperate with a custom allocator?
for example, your Mesh class has a number of std::vector’s. In production code what kind of allocator would they use.
If they used a linear allocator, wouldn’t that eat up a lot of memory the moment the vector needed to grow? Or do you use staticly sized arrays in real production code?
do you use one (linear) allocator for all your static meshes per level or do you use different allocators for the different data in a mesh?
In production code I don’t use vectors, but my own container class(es). In this case a resizable (but not automatically growing) array. The static meshes are all allocated using the same allocator, assuming that those meshes stay during the whole level. If so, then they use the level allocator which is either a linear or stack-like allocator.
Generally speaking, I tend to group allocations based on their lifetime and frequency. For statically sized level allocations I would use the same allocator for all things that need to be in the level at all times. Dynamically allocated assets (spawning units, dynamic meshes, particles, etc.) would use a different allocator.
More general allocator strategies (when to use which allocator) is discussed in more detail in my slides from the Game Connection Paris master class, which are also available on this blog if you’re interested.
So you would not use a vector like array for dynamic (very much growing) arrays?
Instead use a list of pointers to objects instead and allocate those in a pool allocator for example?
in your example ‘std::vector m_boundingBoxes’ doesn’t really grow at all, it’s loaded once and never grown afterwards, right?
> So you would not use a vector like array for dynamic (very much growing) arrays?
> Instead use a list of pointers to objects instead and allocate those in a pool allocator for example?
vectors mostly have the disadvantage that you temporarily need more memory just for growing. E.g. if you have N elements and grow the vector, you temporarily need space for 3N (N+2N) elements, eventhough you only use 2N (assuming a vector grows by doubling its size). There’s other alternatives which could be better suited, e.g. making use of the virtual memory system which means you only ever need 2N but never 3N memory.
If you don’t need the elements to be contiguous from start to finish, a growing pool allocator is also a good alternative. It really depends on the use case: frequency of insertion, deletion, and type of memory access. If you want O(1) insertion, deletion, and linear, contiguous memory access, deleting elements by swapping them with the last element of the array is a useful trick. This works like a charm if you handle those arrays internally and never return references to the outside world. If you want to do that, you could always use an ID or handle-based system on top for external references.
But I digress… generally speaking there’s lots of alternatives.
> in your example ‘std::vector m_boundingBoxes’ doesn’t really grow at all, it’s loaded once and never grown afterwards, right?
Exactly. Don’t use a std::vector if there’s no need to grow. I’ve seen far too many cases where people use a std::vector, don’t call resize() or reserve(), and just use push_back() eventhough they know how many elements the vector will contain. Please don’t do that
.