Create expressive smart pointers for remote memory in C ++

Hello, Habr!



Today we are publishing a translation of an interesting study on working with memory and pointers in C ++. The material is a little academic, but it will obviously be of interest to readers of the books of Galowitz and Williams .



Follow the advertisement!



In graduate school, I am engaged in the construction of distributed data structures. Therefore, the abstraction representing the remote pointer is extremely important in my work to create clean and tidy code. In this article, I will explain why smart pointers are needed, tell how I wrote remote pointer objects for my library in C ++, make sure that they work exactly like regular C ++ pointers; this is done using remote link objects. Further, I will explain in what cases this abstraction fails for the simple reason that my own pointer (so far) does not cope with those tasks that ordinary pointers can do. I hope this article will interest readers involved in the development of high-level abstractions.



Low Level APIs



When working with distributed computers or with network hardware, you often have read and write access to a piece of memory through the C API. One example of this kind is the MPI API for one-way communication. This API uses functions that open direct access to read and write from memory of other nodes located in a distributed cluster. Here's how it looks in a slightly simplified way.



void remote_read(void* dst, int target_node, int offset, int size); void remote_write(void* src, int target_node, int offset, int size);
      
      





At the indicated offset into the shared memory segment of the target node, remote_read



a certain number of bytes from it, and remote_write



writes a certain number of bytes.



These APIs are great because they give us access to important primitives that are useful to us for implementing programs running on a cluster of computers. They are also very good because they work really fast and accurately reflect the capabilities offered at the hardware level: remote direct memory access (RDMA). Modern supercomputer networks, such as Cray Aries and Mellanox EDR , allow us to expect that the delay in reading / writing will not exceed 1-2 μs. This indicator can be achieved due to the fact that the network card (NIC) can read and write directly to RAM, without waiting for the remote CPU to wake up and respond to your network request.



However, such APIs are not so good in terms of application programming. Even in the case of such simple APIs as described above, it does not cost anything to accidentally erase data, since there is no separate name for each specific object stored in memory, only one large contiguous buffer. In addition, the interface is untyped, that is, you are deprived of one more tangible help: when the compiler swears, if you write down the value of the wrong type in the wrong place. Your code will simply turn out to be wrong, and errors will be of the most mysterious and catastrophic nature. The situation is even more complicated because in reality these APIs are a little more complicated, and when working with them it is quite possible to mistakenly rearrange two or more parameters.



Deleted Pointers



Pointers are an important and necessary level of abstraction needed when creating high-level programming tools. Using pointers directly can be difficult at times, and you can do a lot of bugs, but pointers are the fundamental building blocks of the code. Data structures and even C ++ links often use pointers under the hood.



If we assume that we will have an API similar to those described above, then a unique location in memory will be indicated by two “coordinates”: (1) the rank or process ID and (2) the offset made to the shared portion of the remote memory occupied by the process with this rank . You can not stop there and make a full-fledged structure.



  template <typename T> struct remote_ptr { size_t rank_; size_t offset_; };
      
      





At this stage, it is already possible to design an API for reading and writing to remote pointers, and this API will be more secure than the one we originally used.



  template <typename T> T rget(const remote_ptr<T> src) { T rv; remote_read(&rv, src.rank_, src.offset_, sizeof(T)); return rv; } template <typename T> void rput(remote_ptr<T> dst, const T& src) { remote_write(&src, dst.rank_, dst.offset_, sizeof(T)); }
      
      





Block transfers look very similar, and here I omit them for brevity. Now, for reading and writing values, you can write the following code:



  remote_ptr<int> ptr = ...; int rval = rget(ptr); rval++; rput(ptr, rval);
      
      





It is already better than the original API, since here we work with typed objects. Now it’s not so easy to write or read a value of the wrong type or write only a part of an object.



Pointer Arithmetic



Pointer arithmetic is the most important technique that allows a programmer to manage collections of values ​​in memory; if we are writing a program for distributed work in memory, presumably we are going to operate with large collections of values.

What does increasing or decreasing a deleted pointer by one mean? The simplest option is to consider the arithmetic of remote pointers as the arithmetic of ordinary pointers: p + 1 simply points to the next sizeof(T)



-aligned memory after p in the shared segment of the original rank.



Although this is not the only possible definition of the arithmetic of remote pointers, it has recently been most actively adopted, and the remote pointers used in this way are contained in libraries such as UPC ++ , DASH and BCL. However, the Unified Parallel C (UPC) language, which has left a rich legacy in the community of high-performance computing (HPC) specialists, contains a more elaborate definition of pointer arithmetic [1].



Implementing pointer arithmetic in this way is simple, and it only involves changing the pointer offset.



  template <typename T> remote_ptr<T> remote_ptr<T>::operator+(std::ptrdiff_t diff) { size_t new_offset = offset_ + sizeof(T)*diff; return remote_ptr<T>{rank_, new_offset}; }
      
      





In this case, we have the opportunity to access data arrays in distributed memory. So, we could achieve that each process in the SPMD program would perform a write or read operation on its variable in the array to which the remote pointer is directed [2].



 void write_array(remote_ptr<int> ptr, size_t len) { if (my_rank() < len) { rput(ptr + my_rank(), my_rank()); } }
      
      





It is also easy to implement other operators, providing support for a complete set of arithmetic operations performed in ordinary pointer arithmetic.



Select nullptr value



For regular pointers, the nullptr



value is NULL



, which usually means reducing #define



to 0x0, since this section in memory is unlikely to be used. In our scheme with remote pointers, we can either select a specific pointer value as nullptr



, thus making this location in memory unused, or include a special Boolean member that will indicate whether the pointer is null. Despite the fact that making a certain location in memory unused is not the best way out, we will also take into account that when adding just one Boolean value, the size of the remote pointer will double from the point of view of most compilers and grow from 128 to 256 bits to maintain alignment. This is especially undesirable. In my library, I chose {0, 0}



, that is, an offset of 0 with a rank of 0, as the value nullptr



.



It may be possible to pick up other options for nullptr



that will work just as well. In addition, in some programming environments, for example, in UPC, narrow pointers are implemented that fit in 64 bits each. Thus, they can be used in atomic comparison operations with exchange. When working with a narrow pointer, you have to compromise: either the offset identifier or the rank identifier should fit in 32 bits or less, and this limits scalability.



Deleted Links



In languages ​​like Python, the bracket statement serves as syntactic sugar for calling the __setitem__



and __getitem__



, depending on whether you read the object or write to it. In C ++, operator[]



does not distinguish which of the value categories the object belongs to and whether the returned value will immediately fall under read or write. To solve this problem, C ++ data structures return links pointing to the memory contained in the container, which can be written or read. The operator[]



implementation for std::vector



might look something like this.



  T& operator[](size_t idx) { return data_[idx]; }
      
      





The most significant fact here is that we return an entity of type T&



, which is a raw C ++ link by which you can write, and not an entity of type T



, which merely represents the value of the source data.



In our case, we cannot return a raw C ++ link, since we are referring to memory located on another node and not represented in our virtual address space. True, we can create our own custom reference objects.

A link is an object that serves as a wrapper around a pointer, and it performs two important functions: it can be converted to a value of type T



, and you can also assign it to a value of type T



So, in the case of a remote reference, we just need to implement an implicit conversion operator that reads the value, and also make an assignment operator that writes to the value.



 template <typename T> struct remote_ref { remote_ptr<T> ptr_; operator T() const { return rget(ptr_); } remote_ref& operator=(const T& value) { rput(ptr_, value); return *this; } };
      
      





Thus, we can enrich our remote pointer with new powerful features, in the presence of which it can be dereferenced exactly like ordinary pointers.



 template <typename T> remote_ref<T> remote_ptr<T>::operator*() { return remote_ref<T>{*this}; } template <typename T> remote_ref<T> remote_ptr<T>::operator[](ptrdiff_t idx) { return remote_ref<T>{*this + idx}; }
      
      





So now we have restored the whole picture showing how you can use deleted pointers as normal. We can rewrite the simple program above.



 void write_array(remote_ptr<int> ptr, size_t len) { if (my_rank() < len) { ptr[my_rank()] = my_rank(); } }
      
      





Of course, our new pointer API allows us to write more complex programs, for example, a function for performing parallel reduction based on a tree [3]. Implementations using our remote pointer class are safer and cleaner than those typically obtained using the C API described above.



Costs arising at runtime (or lack thereof!)



However, what would it cost us to use such a high-level abstraction? Each time we access the memory, we call the dereference method, return the intermediate object that wraps the pointer, then we call the conversion operator or the assignment operator that affects the intermediate object. What will it cost us at runtime?



It turns out that if you carefully designate the pointer and reference classes, then there will be no overhead for this abstraction at runtime - modern C ++ compilers handle these intermediate objects and method calls by aggressive embedding. To evaluate what such an abstraction will cost us, we can compile a simple example program and check how the assembly will go to see what objects and methods will exist at runtime. In the example described here with tree-based reduction compiled with remote pointer and reference classes, modern compilers reduce tree-based reduction to several remote_read



and remote_write



[4]. No class methods are called, no reference objects exist at runtime.



Interaction with data structure libraries



Experienced C ++ programmers remember that the standard C ++ template library states: STL containers must support custom C ++ allocators . Allocators allow you to allocate memory, and then this memory can be referenced using the types of pointers made by us. Does this mean that you can simply create a “remote allocator” and connect it to store data in remote memory using STL containers?



Unfortunately not. Presumably, for performance reasons, the C ++ standard no longer requires support for custom reference types, and in most implementations of the C ++ standard library they really are not supported. So, for example, if you use libstdc ++ from GCC, you can resort to custom pointers, but at the same time you can use only normal C ++ links, which does not allow you to use STL containers in remote memory. Some high-level C ++ template libraries, for example, Agency , using custom pointer types and reference types, contain their own implementations of some data structures from STL that really allow you to work with remote reference types. In this case, the programmer gets more freedom in a creative approach to creating types of allocators, pointers and links, and, in addition, receives a collection of data structures that can automatically be used with them.



Wide context



In this article, we have addressed a number of broader and not yet resolved problems.





Notes



[1] In UPC, pointers have a phase that determines what rank the pointer will be directed after increasing by one. Due to the phases, distributed arrays can be encapsulated in pointers, and distribution patterns in them can be very different. These features are very powerful, but they might seem magical to a novice user. Although some UPC aces do prefer this approach, a more reasonable object-oriented approach is to write a simple remote pointer class first and then provide data distribution based on data structures specifically designed for this.



[2] Most applications in HPC are written in the style of SPMD , this name means "one program, different data." The SPMD API offers a function or variable my_rank()



that tells the process executing the program a unique rank or ID, based on which it is then possible to branch from the main program.



[3] Here is a simple tree reduction written in the SPMD style using the remote pointer class. The code is adapted based on a program originally written by my colleague Andrew Belt .



  template <typename T> T parallel_sum(remote_ptr<T> a, size_t len) { size_t k = len; do { k = (k + 1) / 2; if (my_rank() < k && my_rank() + k < len) { a[my_rank()] += a[my_rank() + k]; } len = k; barrier(); } while (k > 1); return a[0]; }
      
      





[4] The compiled result of the above code can be found here .



All Articles