Hello, Habr!
Today we are publishing a translation of an interesting study on working with memory and pointers in C ++. The material is a little academic, but it will obviously be of interest to readers of the books of
Galowitz and
Williams .
Follow the advertisement!
In graduate school, I am engaged in the construction of distributed data structures. Therefore, the abstraction representing the remote pointer is extremely important in my work to create clean and tidy code. In this article, I will explain why smart pointers are needed, tell how I wrote remote pointer objects for my library in C ++, make sure that they work exactly like regular C ++ pointers; this is done using remote link objects. Further, I will explain in what cases this abstraction fails for the simple reason that my own pointer (so far) does not cope with those tasks that ordinary pointers can do. I hope this article will interest readers involved in the development of high-level abstractions.
Low Level APIs
When working with distributed computers or with network hardware, you often have read and write access to a piece of memory through the C API. One example of this kind is the
MPI API for one-way communication. This API uses functions that open direct access to read and write from memory of other nodes located in a distributed cluster. Here's how it looks in a slightly simplified way.
void remote_read(void* dst, int target_node, int offset, int size); void remote_write(void* src, int target_node, int offset, int size);
At the indicated
offset into the shared memory segment of the target node,
remote_read
a certain number of bytes from it, and
remote_write
writes a certain number of bytes.
These APIs are great because they give us access to important primitives that are useful to us for implementing programs running on a cluster of computers. They are also very good because they work really fast and accurately reflect the capabilities offered at the hardware level: remote direct memory access (RDMA). Modern supercomputer networks, such as
Cray Aries and
Mellanox EDR , allow us to expect that the delay in reading / writing will not exceed 1-2 μs. This indicator can be achieved due to the fact that the network card (NIC) can read and write directly to RAM, without waiting for the remote CPU to wake up and respond to your network request.
However, such APIs are not so good in terms of application programming. Even in the case of such simple APIs as described above, it does not cost anything to accidentally erase data, since there is no separate name for each specific object stored in memory, only one large contiguous buffer. In addition, the interface is untyped, that is, you are deprived of one more tangible help: when the compiler swears, if you write down the value of the wrong type in the wrong place. Your code will simply turn out to be wrong, and errors will be of the most mysterious and catastrophic nature. The situation is even more complicated because in reality
these APIs are a little more complicated, and when working with them it is quite possible to mistakenly rearrange two or more parameters.
Deleted Pointers
Pointers are an important and necessary level of abstraction needed when creating high-level programming tools. Using pointers directly can be difficult at times, and you can do a lot of bugs, but pointers are the fundamental building blocks of the code. Data structures and even C ++ links often use pointers under the hood.
If we assume that we will have an API similar to those described above, then a unique location in memory will be indicated by two “coordinates”: (1) the
rank or process ID and (2) the offset made to the shared portion of the remote memory occupied by the process with this rank . You can not stop there and make a full-fledged structure.
template <typename T> struct remote_ptr { size_t rank_; size_t offset_; };
At this stage, it is already possible to design an API for reading and writing to remote pointers, and this API will be more secure than the one we originally used.
template <typename T> T rget(const remote_ptr<T> src) { T rv; remote_read(&rv, src.rank_, src.offset_, sizeof(T)); return rv; } template <typename T> void rput(remote_ptr<T> dst, const T& src) { remote_write(&src, dst.rank_, dst.offset_, sizeof(T)); }
Block transfers look very similar, and here I omit them for brevity. Now, for reading and writing values, you can write the following code:
remote_ptr<int> ptr = ...; int rval = rget(ptr); rval++; rput(ptr, rval);
It is already better than the original API, since here we work with typed objects. Now it’s not so easy to write or read a value of the wrong type or write only a part of an object.
Pointer Arithmetic
Pointer arithmetic is the most important technique that allows a programmer to manage collections of values in memory; if we are writing a program for distributed work in memory, presumably we are going to operate with large collections of values.
What does increasing or decreasing a deleted pointer by one mean? The simplest option is to consider the arithmetic of remote pointers as the arithmetic of ordinary pointers: p + 1 simply points to the next
sizeof(T)
-aligned memory after p in the shared segment of the original rank.
Although this is not the only possible definition of the arithmetic of remote pointers, it has recently been most actively adopted, and the remote pointers used in this way are contained in libraries such as
UPC ++ ,
DASH and BCL. However, the
Unified Parallel C (UPC) language, which has left a rich legacy in the community of high-performance computing (HPC) specialists, contains a more elaborate definition of pointer arithmetic [1].
Implementing pointer arithmetic in this way is simple, and it only involves changing the pointer offset.
template <typename T> remote_ptr<T> remote_ptr<T>::operator+(std::ptrdiff_t diff) { size_t new_offset = offset_ + sizeof(T)*diff; return remote_ptr<T>{rank_, new_offset}; }
In this case, we have the opportunity to access data arrays in distributed memory. So, we could achieve that each process in the SPMD program would perform a write or read operation on its variable in the array to which the remote pointer is directed [2].
void write_array(remote_ptr<int> ptr, size_t len) { if (my_rank() < len) { rput(ptr + my_rank(), my_rank()); } }
It is also easy to implement other operators, providing support for a complete set of arithmetic operations performed in ordinary pointer arithmetic.
Select nullptr value
For regular pointers, the
nullptr
value is
NULL
, which usually means reducing
#define
to 0x0, since this section in memory is unlikely to be used. In our scheme with remote pointers, we can either select a specific pointer value as
nullptr
, thus making this location in memory unused, or include a special Boolean member that will indicate whether the pointer is null. Despite the fact that making a certain location in memory unused is not the best way out, we will also take into account that when adding just one Boolean value, the size of the remote pointer will double from the point of view of most compilers and grow from 128 to 256 bits to maintain alignment. This is especially undesirable. In my library, I chose
{0, 0}
, that is, an offset of 0 with a rank of 0, as the value
nullptr
.
It may be possible to pick up other options for
nullptr
that will work just as well. In addition, in some programming environments, for example, in UPC, narrow pointers are implemented that fit in 64 bits each. Thus, they can be used in atomic comparison operations with exchange. When working with a narrow pointer, you have to compromise: either the offset identifier or the rank identifier should fit in 32 bits or less, and this limits scalability.
Deleted Links
In languages like Python, the bracket statement serves as syntactic sugar for calling the
__setitem__
and
__getitem__
, depending on whether you read the object or write to it. In C ++,
operator[]
does not distinguish which of
the value categories the object belongs to and whether the returned value will immediately fall under read or write. To solve this problem, C ++ data structures return links pointing to the memory contained in the container, which can be written or read. The
operator[]
implementation for
std::vector
might look something like this.
T& operator[](size_t idx) { return data_[idx]; }
The most significant fact here is that we return an entity of type
T&
, which is a raw C ++ link by which you can write, and not an entity of type
T
, which merely represents the value of the source data.
In our case, we cannot return a raw C ++ link, since we are referring to memory located on another node and not represented in our virtual address space. True, we can create our own custom reference objects.
A link is an object that serves as a wrapper around a pointer, and it performs two important functions: it can be converted to a value of type
T
, and you can also assign it to a value of type
T
So, in the case of a remote reference, we just need to implement an implicit conversion operator that reads the value, and also make an assignment operator that writes to the value.
template <typename T> struct remote_ref { remote_ptr<T> ptr_; operator T() const { return rget(ptr_); } remote_ref& operator=(const T& value) { rput(ptr_, value); return *this; } };
Thus, we can enrich our remote pointer with new powerful features, in the presence of which it can be dereferenced exactly like ordinary pointers.
template <typename T> remote_ref<T> remote_ptr<T>::operator*() { return remote_ref<T>{*this}; } template <typename T> remote_ref<T> remote_ptr<T>::operator[](ptrdiff_t idx) { return remote_ref<T>{*this + idx}; }
So now we have restored the whole picture showing how you can use deleted pointers as normal. We can rewrite the simple program above.
void write_array(remote_ptr<int> ptr, size_t len) { if (my_rank() < len) { ptr[my_rank()] = my_rank(); } }
Of course, our new pointer API allows us to write more complex programs, for example, a function for performing parallel reduction based on a tree [3]. Implementations using our remote pointer class are safer and cleaner than those typically obtained using the C API described above.
Costs arising at runtime (or lack thereof!)
However, what would it cost us to use such a high-level abstraction? Each time we access the memory, we call the dereference method, return the intermediate object that wraps the pointer, then we call the conversion operator or the assignment operator that affects the intermediate object. What will it cost us at runtime?
It turns out that if you carefully designate the pointer and reference classes, then there will be no overhead for this abstraction at runtime - modern C ++ compilers handle these intermediate objects and method calls by aggressive embedding. To evaluate what such an abstraction will cost us, we can compile a simple example program and check how the assembly will go to see what objects and methods will exist at runtime. In the example described here with tree-based reduction compiled with remote pointer and reference classes, modern compilers reduce tree-based reduction to several
remote_read
and
remote_write
[4]. No class methods are called, no reference objects exist at runtime.
Interaction with data structure libraries
Experienced C ++ programmers remember that the standard C ++ template library states: STL containers must support
custom C ++ allocators . Allocators allow you to allocate memory, and then this memory can be referenced using the types of pointers made by us. Does this mean that you can simply create a “remote allocator” and connect it to store data in remote memory using STL containers?
Unfortunately not. Presumably, for performance reasons, the C ++ standard no longer requires support for custom reference types, and in most implementations of the C ++ standard library they really are not supported. So, for example, if you use libstdc ++ from GCC, you can resort to custom pointers, but at the same time you can use only normal C ++ links, which does not allow you to use STL containers in remote memory. Some high-level C ++ template libraries, for example,
Agency , using custom pointer types and reference types, contain their own implementations of some data structures from STL that really allow you to work with remote reference types. In this case, the programmer gets more freedom in a creative approach to creating types of allocators, pointers and links, and, in addition, receives a collection of data structures that can automatically be used with them.
Wide context
In this article, we have addressed a number of broader and not yet resolved problems.
- Memory allocation . Now that we can reference objects in remote memory, how do we reserve or allocate such remote memory?
- Support for objects . What about the storage in remote memory of such objects that are of types more complicated than int? Is neat support for complex types possible? Can simple types be supported at the same time without wasting resources on serialization?
- Designing distributed data structures . Now that you have these abstractions, what data structures and applications can you build with them? What abstractions should be used for data distribution?
Notes
[1] In UPC, pointers have a phase that determines what rank the pointer will be directed after increasing by one. Due to the phases, distributed arrays can be encapsulated in pointers, and distribution patterns in them can be very different. These features are very powerful, but they might seem magical to a novice user. Although some UPC aces do prefer this approach, a more reasonable object-oriented approach is to write a simple remote pointer class first and then provide data distribution based on data structures specifically designed for this.
[2] Most applications in HPC are written in the style of
SPMD , this name means "one program, different data." The SPMD API offers a function or variable
my_rank()
that tells the process executing the program a unique rank or ID, based on which it is then possible to branch from the main program.
[3] Here is a simple tree reduction written in the SPMD style using the remote pointer class. The code is adapted based on a program originally written by my colleague
Andrew Belt .
template <typename T> T parallel_sum(remote_ptr<T> a, size_t len) { size_t k = len; do { k = (k + 1) / 2; if (my_rank() < k && my_rank() + k < len) { a[my_rank()] += a[my_rank() + k]; } len = k; barrier(); } while (k > 1); return a[0]; }
[4] The compiled result of the above code
can be found here .