Memory management or less often shoot yourself in the foot

Hello, Habr! In this article I will try to tell what memory management in programs / applications is from the point of view of an application programmer. This is not an exhaustive guide or manual, but simply an overview of existing problems and some approaches to solving them.







Why is this necessary? A program is a sequence of data processing instructions (in the most general case). This data needs to be stored , loaded , transferred , etc. in some way. All these operations do not occur instantly, therefore, they directly affect the speed of your final application. The ability to optimally manage data in the process of work will allow you to create very non-trivial and very resource-demanding programs.







Note: the bulk of the material is presented with examples from games / game engines (since this topic is more interesting for me personally), but most of the material can be applied to writing servers, user applications, graphics packages, etc.













It is impossible to keep everything in mind. But if you didn’t manage to load it, you’ll get soap







Right off the bat



It so happened in the industry that large AAA game projects are developed primarily on engines written using C ++. One of the features of this language is the need for manual memory management. Java / C # etc. They can boast of garbage collection (GarbageCollection / GC) - the ability to create objects and still not free up used memory by hand. This process simplifies and speeds up development, but it can also cause some problems: a periodically triggered garbage collector can kill all soft-real time and add unpleasant freezes to the game.







Yes, in projects like "Minecraft" the GC may not be noticeable, as they are generally not demanding on the resources of the computer, however, games such as "Red Dead Redemption 2", "God of War", "Last of Us" work "almost" at the peak of system performance and therefore need not only large the amount of resources, but also in their competent distribution.







In addition, when working in an environment with automatic memory allocation and garbage collection, you may encounter a lack of flexibility in managing resources. It is no secret that Java hides all the implementation details and aspects of its work under the hood, so at the output you have only the installed interface for interacting with system resources, but it may not be enough for solving some problems. For example, starting an algorithm with a non-constant number of memory allocations in each frame (this can be a search for paths for AI, checking visibility, animation, etc.) inevitably leads to a catastrophic drop in performance.







How allocations in the code look



Before continuing the discussion, I would like to show how memory operations in C / C ++ work directly with a couple of examples. In general, the standard and simplest interface for allocating process memory is represented by the following operations:







//        size  void* malloc(size_t size); //      p void free(void* p);
      
      





Here you can add about additional functions that allow you to allocate a aligned piece of memory:







 // C11  -     , * alignment void* aligned_alloc(size_t size, size_t alignment); // Posix  -       //        address (*address = allocated_mem_p) int posix_memalign(void** address, size_t alignment, size_t size);
      
      





Please note that on different platforms different function standards can be supported, available, for example on macOS, and not available on win.







Looking ahead, specially aligned memory areas may be needed both for getting into the processor cache lines and for calculations using an extended set of registers ( SSE , MMX , AVX , etc.).







An example of a “toy” program that allocates memory and prints buffer values, interpreting them as signed integers:







 /* main.cpp */ #include <cstdio> #include <cstdlib> int main(int argc, char** argv) { const int N = 10; int* buffer = (int*) malloc(sizeof(int) * N); for(int i = 0; i < N; i++) { printf("%i ", buffer[i]); } free(buffer); return 0; }
      
      





On macOS 10.14, this program can be built and run with the following set of commands:







 $ clang++ main.cpp -o main $ ./main
      
      





Note: hereafter, I don’t really want to cover C ++ operations such as new / delete, since they are more likely to be used to construct / destroy objects directly, but they use the usual operations with memory like malloc / free.







Memory problems



There are several problems that arise when working with the computer's RAM. All of them, one way or another, are caused not only by the features of the OS and software, but also by the architecture of the iron on which all this stuff works.







1. Amount of memory



Unfortunately, memory is physically limited. On the PlayStation 4, this is 8 GiB GDDR5, 3.5 GiB of which the operating system reserves for its needs . Virtual memory and page swapping will not help much, since swapping pages to disk is a very slow operation (within a fixed N frames per second, if we talk about games).







It is also worth noting the limited " budget " - some artificial limitation on the amount of memory used, created in order to run the application on several platforms. If you are creating a game for a mobile platform and want to support not just one, but a whole line of devices, you will have to limit your appetites for the sake of providing a wider market. This can be achieved both by simply limiting the consumption of RAM, and by the ability to configure this restriction depending on the gadget on which the game actually starts.







2. Fragmentation



An unpleasant effect that appears during the process of multiple allocations of pieces of memory of various sizes. As a result, you get an address space fragmented into many separate parts. It will not work to combine these parts into single blocks of a larger size, since part of the memory is occupied, and we cannot freely move it.









Fragmentation by the example of sequential allocations and releases of memory blocks







As a result: we may have enough free memory quantitatively, but not qualitatively. And the next request, say, “allocate space for the audio track”, the allocator will not be able to satisfy, because there is simply no single piece of memory of this size.







3. CPU cache





Computer memory hierarchy







The cache of a modern processor is an intermediate link that connects the main memory (RAM) and the processor registers directly. It so happened that read / write access to memory is a very slow operation (if we talk about the number of CPU clock cycles required to execute). Therefore, there is some cache hierarchy (L1, L2, L3, etc.), which allows, as it were, "according to some prediction" to load data from RAM, or slowly push it into a slower memory.







Placing objects of the same type in a row in memory allows you to "significantly" speed up the process of processing them (if processing occurs sequentially), since in this case it is easier to predict what data will be needed next. And by "significant" is meant performance gains at times. This was repeatedly mentioned by the developers of the Unity engine in their reports at the GDC .







4. Multi-threading



Providing secure access to shared memory in a multi-threaded environment is one of the main problems that you will have to solve when creating your own game engine / game / any other application that uses multiple threads to achieve greater performance. Modern computers are arranged in a very non-trivial way. We have both a complex cache structure and several calculator cores. All this, if used improperly, can lead to situations when the shared data of your process will be damaged as a result of several threads (if they simultaneously try to work with this data without access control). In the simplest case, it will look like this:



I do not want to delve into the topic of multithreaded programming, since many of its aspects go very far beyond the scope of the article or even the whole book.







5. Malloc / free



Allocation / release operations do not occur instantly. On modern operating systems, if we are talking about Windows / Linux / MacOS, they are implemented well and work quickly in most situations . But potentially this is a very time-consuming operation. Not only is this a system call, but depending on the implementation, it may take a while to find a suitable piece of memory (First Fit, Best fit, etc.) or to find a place to insert and / or merge the freed area.







In addition, freshly allocated memory may not actually be mapped onto real physical pages, which may also take some time on first access.







These are implementation details, but what about applicability? Malloc / new have no idea where you are, how and why you called them. They allocate memory (in the worst case) of 1 KiB and 100 MiB equally ... equally bad. Directly, the use strategy is left to either the programmer or the one who implemented the runtime of your program.







6. Memory corruption



As the wiki says , this is one of the most unpredictable errors, manifests itself only during the work of the program, and is most often caused directly by errors in the writing of this program. But what is this problem? Fortunately (or unfortunately), it is not related to the corruption of your computer. Rather, it displays a situation where you are trying to work with memory that does not belong to you . I will explain now:







  1. This may be an attempt to read / write to a portion of unallocated memory.
  2. Going beyond the boundaries of the memory block provided to you. This problem is a kind of special case of problem (1), but it is worse because the system will tell you that you went beyond the boundaries only when you leave the page displayed for you. That is, potentially, this problem is very difficult to catch, because the OS is able to respond only if you leave the limits of the virtual pages displayed to you. You can spoil the process memory and get a very strange error from the place from which it was not expected at all.
  3. Releasing an already freed (sounds strange) or not yet allocated memory
  4. etc.


In C / C ++, where there is pointer arithmetic, you will come across this one or two times. However, Java Runtime will have to sweat pretty hard to get this kind of error (I haven’t tried it myself, but I think that this is possible, otherwise life would be too simple).







7. Memory leaks



It is a special case of a more general problem that occurs in many programming languages. The standard C / C ++ library provides access to OS resources. It can be files, sockets, memory, etc. After use, the resource must be correctly closed and

the memory occupied by him should be freed. And speaking specifically about the release of memory - accumulated leaks as a result of the program can lead to an "out of memory" error when the OS will not be able to satisfy the next request for allocation. Often, the developer simply forgets to free the used memory for one reason or another.







Here it’s worth adding about the correct closing and release of resources on the GPU, because the early drivers did not allow to resume working with the video card if the previous session was not completed correctly. Only rebooting the system could solve this problem, which is very doubtful - forcing the user to reboot the system after your application is running.







8. Dangling pointer



A dangling pointer is some kind of jargon that describes a situation where a pointer refers to an invalid value. A similar situation can easily arise when using classic C-style pointers in a C / C ++ program. Suppose you allocated memory, saved the address to it in the p pointer, and then freed the memory (see the code example):







 //   void* p = malloc(size); // ...  -    //   free(p); //    p? // *p == ?
      
      





The pointer stores some value, which we can interpret as the address of the memory block. It so happened that we cannot say whether this memory block is valid or not. Only a programmer, based on certain conventions, can operate on a pointer. Starting with C ++ 11, a number of additional “smart pointers” pointers were introduced into the standard library, which allow, in some respects, weakening control of resources by the programmer by using additional meta-information inside themselves (more on this later).







As a partial solution, you can use the special value of the pointer, which will signal us that there is nothing at this address. In C, the NULL macro is used as the value of this value, and in C ++, the nullptr language keyword is used. The solution is partial, because:







  1. The pointer value must be set manually, so the programmer may simply forget to do it.
  2. nullptr or just 0x0 is included in the set of values ​​accepted by the pointer, which is not good when the special state of an object is expressed through its usual state. This is some kind of legacy, and by agreement, the OS will not allocate to you a piece of memory whose address starts with 0x0.


Sample code with null:







 //  -  p free(p); p = nullptr; //   p == nullptr   ,       
      
      





You can automate this process to some extent:







 void _free(void* &p) { free(p); p = nullptr; } //  -  p _free(p); //   p == nullptr,     //   
      
      





9. Type of memory



RAM is an ordinary general-purpose random access memory, access to which through the central bus have all the cores of your processor and peripheral devices. Its volume varies, but most often we are talking about N gigabytes, where N is 1,2,4,8,16 and so on. Calls malloc / free seek to place the block of memory you want right in the computer's RAM.







VRAM (video memory) - video memory, supplied with the video card / video accelerator of your PC. It, as a rule, is smaller than RAM (about 1.2.4 GiB), but it has a high speed. The distribution of this type of memory is handled by the video card driver, and most often you do not have direct access to it.







There is no such separation on the PlayStation 4, and all RAM is represented by a single 8 gigabytes on GDDR5. Therefore, all the data for both the processor and the video accelerator are nearby.







Good resource management in the game engine includes proper memory allocation both in Main RAM and on the VRAM side. Here you may encounter duplication , when the same data will be there and there, or with excessive transfer of data from RAM to VRAM and vice versa.







As an illustration to all the problems voiced : you can look at aspects of the device computers on the example of the PlayStation 4 architecture (Fig.). Here is the central processor, 8 cores, L1 and L2 level caches, data buses, RAM, graphics accelerator, etc. For a complete and detailed description, see Jason Gregory's "Game Engine Architecture" .









PlayStation 4 Architecture







General Approaches



There is no universal solution. But there is a set of some points on which you should focus if you are going to implement manually allocation and memory management in your application. This includes containers and specialized allocators, memory allocation strategies, system / game design, resource managers, and more.







Types of allocators



The use of special memory allocators is based on the following idea: you know what size, at what moments of work and in what place you will need pieces of memory. Therefore, you can allocate the necessary memory, somehow structure it and use / reuse it. This is the general idea / concept of using special allocators. What they are (of course, not all) can be seen further:







  1. Linear allocator

    Represents a contiguous address space buffer. In the course of work, it allows you to allocate sections of memory of arbitrary size (such that they fit in a buffer). But you can free all the allocated memory only 1 time. That is, an arbitrary piece of memory cannot be freed - it will remain as if occupied until the entire buffer is marked as clean. This design provides the allocation and release of O (1), which gives a guarantee of speed under any conditions.



    Typical use-case: in the process of updating the state of the process (each frame in the game) you can use LinearAllocator to allocate tmp buffers for any technical needs: input processing, working with strings, parsing ConsoleManager commands in debug mode, etc.







  2. Stack allocator

    Modification of a linear allocator. Allows you to free memory in the reverse order of allocation, in other words, behaves like a regular stack according to the LIFO principle. It can be very useful for carrying out loaded mathematical calculations (hierarchy of transformations), for implementing the work of the scripting subsystem, for any calculations where the indicated procedure for freeing memory is known in advance.



    The simplicity of the design provides O (1) memory allocation and freeing speed.







  3. Pool allocator

    Allows you to allocate memory blocks of the same size. It can be implemented as a buffer of continuous address space, divided into blocks of a predetermined size. These blocks can form a linked list. And we always know which block to give in the next allocation. This meta information can be stored in the blocks themselves, which imposes a restriction on the minimum block size (sizeof (void *)). In reality, this is not critical.



    Since all blocks are of the same size, it doesn’t matter to us which block to return, and therefore all the allocation / deallocation operations can be performed in O (1).







  4. Frame allocator

    Linear allocator but only with reference to the current frame - allows you to do tmp memory allocation and then automatically release everything when changing the frame. It should be singled out separately, since this is some global and unique entity within the framework of the runtime game, and therefore it can be made of a very impressive size, say a couple of dozen MiB, which will be very useful when loading resources and processing them.







  5. Double frame allocator

    It is a double frame allocator, but with some features. It allows you to allocate memory in the current frame, and use it in both the current and the next frame. That is, the memory that you allocated in frame N will be freed only after N + 1 frame. This is accomplished by switching the active frame to highlight at the end of each frame.



    But this type of allocator, like the previous one, imposes a number of restrictions on the lifetime of objects created in the memory allocated to it. Therefore, you should be aware that at the end of the frame, the data simply becomes invalid, and repeated access to them can cause serious problems.







  6. Static allocator

    This type of allocator allocates memory from a buffer, obtained, for example, at the program launch stage, or captured on the stack in a function frame. By type, it can be absolutely any allocator: linear, pool, stack. Why is it called static ? The size of the captured memory buffer should be known at the stage of compilation of the program. This imposes a significant limitation: the amount of memory available to this allocator cannot be changed during operation. But what are the benefits? The buffer used will be automatically captured and then freed (either upon completion of work or upon exit from the function). This does not load the heap, saves you from fragmentation, allows you to quickly allocate memory in place.

    You can look at the example of code using this allocator, if you need to break the string into substrings and do something with them:



    It can also be noted that the use of memory from the stack in theory is much more efficient, because stack the frame of the current function with a high probability will already be in the processor cache.









All these allocators somehow solve the problems with fragmentation, with a lack of memory, with the speed of receiving and releasing blocks of the required size, with the lifetime of objects and the memory they occupy.







It should also be noted that the right approach to interface design will allow you to create a kind of hierarchy of allocators when, for example: pool allocates memory from frame alloc, and frame alloc in turn allocates memory from linear alloc. A similar structure can be continued further, adapting to your tasks and needs.













I see a similar interface for creating hierarchies as follows:







 class IAllocator { public: virtual void* alloc(size_t size) = 0; virtual void* alloc(size_t size, size_t alignment) = 0; virtual void free (void* &p) = 0; }
      
      





malloc/free , . , , . / , .









Smart pointer — C++ ++11 ( boost, ). -, , - , . .







? :







  1. (/)


:







  1. Unique pointer

    1 ( ).

    unique pointer , . , .. 1 / .

    uniquePtr1 uniquePtr2, uniquePtr1 , . 1 .







  2. Shared pointer

    (reference counting). , , . , , , .



    . -, , . . -, - .







  3. Weak pointer

    . , . What does it mean? shared pointer. , shared pointer , . , shared pointer weak pointer. , (shared) , weak pointer shared pointer. — weak pointer , , , .



    shared, weak pointer meta-data . - , .. , O(N) overhead , N — - . , . , . .









: . , shared pointer, , ( ) - - - . . meta-info , , . Example:







 /*     */ /*   ,  shared pointer */ Array<TSharedPtr<Object>> objects; objects.add(newShared<Object>(...)); ... objects.add(newShared<Object>(...));
      
      





 /*      (   meta-info    ) */ Array<Object> objects; objects.emplace(...); ... objects.emplace(...);
      
      





. . About it further.







Unique id



, . (id/identificator), , , -. :









  1. , id. , , , id.


  2. , ( , )


  3. id , , id.


  4. . , id, .


: id, , id, .







id , (Vulkan, OpenGL), (Godot, CryEngine). EntityID CryEngine .







, id : . , ( ), , .







 /*    */ class ID { uint32 index; uint32 generation; }
      
      





 /*  - /  */ class ObjectManager { public: ID create(...); void destroy(ID); void update(ID id, ...); private: Array<uint32> generations; Array<Objects> objects; }
      
      





ID , ID . :







 generation = generations[id.index]; if (generation == id.generation) then /*    */ else /*  ,     */
      
      





id generation 1 id ids.







Containers



C++ , . std, , . :









? memory corruption. / , , , , .









, , . , , / .









, , . , ( ) . , malloc/free , , .







? , (/ ), , , . , , , .









ryEngine Sandbox:







, Unreal, Unity, CryEngine ., , . , , , — , .







Pre-allocating



, / .







: malloc/free . , "run out of memory", . . , (, , .).







. . , - . , malloc/free, : , , .









. : , , , .. .







: , , , . open-source , , . , , — malloc/free.









GDC CD Project Red , , "The Witcher: Blood and Wine" () . , , , , .







Naughty Dog , "Uncharted 4: A Thief's End" , (, ) .







Conclusion



, , , . , . / , , - .. , (, ).












All Articles