Friday, March 20, 2015

Building your Core library (Part 4) - Containers

Just about any software you write needs some way to store and parse data, game engines being no exception. The most common (and easiest) use of containers comes from the C++ Standard Template Library (STL) which provides easy access to containers such as vectors, lists, queues, maps, etc. However, game engines rarely rely on the containers in this library for multiple reasons:

  • You lose direct control over the container's performance
    • Game engines are performance-critical and anytime you lose the ability to optimize something it could make an impact on how high your framerate can be. Oftentimes the containers in STL are fairly generalized whereas the equivalent container in a game engine can be written for a more specific use-case, tuned for better performance.
  • How memory is allocated
    • By default, STL containers will use the operating system's memory allocator to allocate/deallocate memory. While you can supply custom allocators, they're a hassle to work with, are unable to carry state information, and force your code to have template creep. Game engines often need to have very specific control over how memory is used, sometimes even forbidding deallocations by game code altogether. This need for control usually rules out any use of the STL memory allocators. 
  • Underlying container algorithms
    • From a compiler implementation viewpoint, STL is just an API. How it's implemented underneath is totally up to the specific compiler you're using (GCC, CLANG, Visual Studio, etc.). Therefore, on one platform an std::map may be a red-black tree and on another it may not. Even if it is a red-black tree, that doesn't necessarily mean it's a good idea for your use case. Maybe your data is fairly small (< 100 elements). In that case, a red-black tree can be pretty overkill for storing your data as the overhead of processing it can be more expensive than just searching some linear list. There are of course other searchable STL containers (such as std::set) but the point is, you can't control the underlying implementation algorithm to be specific for your needs.
  • Cache locality
    • Properly exploiting cache locality can be a huge win for a game engine's performance. Certain algorithmic complexity requirements of STL operations may force the underlying implementations to store data in a way that can ruin cache locality (for example, constant-time deletion in an std::map). This kind of storage can be a nice win for lots of insertions/deletions into a container but oftentimes in a game there's more querying of the containers than there is container modifications.
  • Serialization
    • In a future post we'll get into the intricacies of serializing game objects. The things to keep in mind here is that any data stored in a game engine's containers can get automatically serialized for the user without them having to do anything. This can be much simpler than a user having to manually serialize data located in an STL container.
  • Container naming
    • This point may seem a little contrived, but oftentimes things like vector or map in a game engine would refer to a mathematical vector or a map designating some region of a game level (or a texture map). This kind of naming can be confusing for any programmers not familiar with STL. However, having containers named obvious container names (such as array instead of vector) may go a long way in removing any confusion.
These are just a few points to use in arguing against the use of STL in any performance-critical code. I'm not saying that STL isn't great or that you shouldn't use it, just that these kinds of things should be considered before using it all over a game engine. Personally I'm a large fan of how easy STL is to use, especially because many programmers understand it and can fairly easily pickup code where it's in use. 

Now that we've gotten through all of the reasons Qi doesn't use STL, let's get into the custom containers currently in the engine


Array


Analogous to STL's vector class is the Qi Array. This container is designed for fast iteration by exploiting cache locality for all of the elements by always storing everything in a contiguous region of memory. There is no upper bound to the memory usage of Array as there would be with a static C++ array. To achieve this, whenever a new element is pushed onto the array a check is performed to see if the the array is already full. If it is, a new memory allocation occurs that is double the size of the previous array. All elements currently in the array are copied to the new one and the previous allocation is deleted. Any future push operations will make use of this new memory until again the array is full.

If you were to know the amount of elements that were necessary ahead of time, Array exposes functionality to manually set the size of the internal array allocation so that you can avoid this possibly expensive re-allocation and copy step. I would recommend that you always do this when possible in order to optimize your memory use and avoid any potential memory fragmentation that can come from repeated allocations.

Currently, Array supports the following functionality:
  • PushBack - Add a new element to the end of the array
  • Resize -  Allocate a specifically sized array
  • GetSize - Query the current size of the array
  • Clear - Deallocate the internal array
  • Sort - Sort the array data in either ascending or descending order
  • Indexing an element
  • Array copying and assignment
More functionality will be added as it becomes necessary (such as automatic serialization) but this class is designed to be simple and lightweight so I don't expect too much more will be added here over the course of development.

Internally, Array is a simple pointer which gets dynamically allocated/deallocated through Qi's memory system (which is still currently under development and will be written about in a future post):
template<class T>
class Array
{
    private:
        T      *m_elements; // Internal array memory.
        uint32  m_back;     // Next location to insert in the array.
        uint32  m_count;    // Number of elements in the array.
};

PushBack is the most complicated part of Array:
template<class T>
bool Array<t>::PushBack(const T &value)
{
    if (m_count >= m_allocatedSize)
    {
        Reallocate();
    }
    
    // Get the address of the next insertion location.
    void *buffer = (void *)&m_elements[m_back];

    // Copy 'value' into the memory referenced by 'buffer' by using
    // T's copy constructor.
    new (buffer) T(value);

    ++m_count;
    ++m_back;
    
    return true;
}
The interesting thing to note here is the use of placement new. This is a different use of the new keyword where no memory is actually allocated (as it's already been done) but the constructor (here a copy constructor) is ran over the memory in order to properly initialize it. In this case, the specific memory being initialized is pointed to by buffer and placement new is used to initialize it to the passed-in value variable. This is necessary because the compiler will not let you write something as simple as m_elements[m_back] = value if T is a class which contains constant data members. Constant members cannot be assigned but they can be copied. The use of placement new in this scenario gets around this issue.

The latest source for Array is located here.


LocklessQueue


Qi will make heavy use of multithreading in order to take advantage of the multi-core processors available on current machines. One of the common ways for different threads to work together is a job system in which each thread grabs a job to work on and processes it, possibly generating new jobs. One commonly used container to store jobs in this manner is a queue. Why a queue? Generally because it works in a FIFO manner so jobs that were generated first will be processed first. Additionally a queue can also be easily changed into a priority queue if you want your job system to use job priorities within the same queue. 

One large downside with using a standard queue in this fashion however is threading issues. Being that the job system use case has multiple threads pushing and popping jobs you will run into lots of threading issues unless your queue is threadsafe. One easy way to do that is to put a mutex inside of your queue which ensures that only one thread at a time is interacting with it. While this will work, it can quickly become a performance bottleneck for your job system because every thread will be waiting on every other thread while interacting with the queue. To fix this, we can turn to the idea of a lockless queue. A good introduction to lock-free programming can be found here.

Qi implements the class LocklessQueue which operates very similar to a standard queue with the difference being that it is threadsafe in a lock-free manner. The code was adapted from this article and extends it with a bit more functionality and features from C++11. Just like any queue, it needs to support the following operations:
  • Push - Add a new element to the end of the queue
  • Pop - Remove the top element from the top of the queue
  • GetSize - Determine the size of the queue
  • Clear - Remove all elements of the queue
In order to avoid the need for dynamic memory, LocklessQueue uses a single array internally to represent the queue. The benefits of this is only one dynamic memory allocation/deallocation, avoidance of the ABA problem, and keeps elements continuous in the cache. The downside is it's a little more complicated to keep track of the current head and tail of the queue as well as hard upper-bound on the queue size. During development, you'll have to play with the queue size in order to get it right for your game. 

The header for LocklessQueue is very simple:
#include <atomic>
template<class T>
class LocklessQueue
{
    private:
        Array<T> m_queue;
        std::atomic<uint32> m_writeIndex; // Current tail of the queue.
        std::atomic<uint32> m_readIndex;  // Current head of the queue.
    ...
};
Note the use of std::atomic. This is a special type defined in the C++11 library atomic which allows you to specify variables that are free from data races. In other words, if one thread is reading and another thread is writing to the same atomic variable the behavior is well-defined.

On top of atomic types, atomic also exposes the library function compare_and_exchange() which is at the heart of lockless programming. The basic idea is to attempt to swap a value (in this case the current read/write index of the queue). If you can't because some other thread did it before you, try again until you succeed. compare_and_exchange() is an atomic operation so you're guaranteed that any calls to it with atomic values won't have any threading issues.

The latest source for LocklessQueue is available here.

--------

That's it for the current containers in Qi. As time goes on I expect to add a few more (notably a hash table) but these two containers will make the heart of much of the data storage in the engine.

Until next time!

No comments:

Post a Comment