Fancy Linux system calls

ls / usr / share / man / man2 /







What does a programmer see when starting to work with the C language? He sees fopen



, printf



, scanf



and many more other functions. He sees all sorts of open



and mmap



- it would seem, why highlight them? But, unlike the first group, these two functions when executed on the Linux kernel are system calls ( in fact, no , almost never a system call can simply be called as a function, and therefore libc



contains wrappers that repack arguments and sometimes, as is the case with open



, replacing old system calls with more general new ones). In general, unlike the thousands of library functions available on a typical GNU / Linux system, the kernel interface has a rather limited number of entry points - of the order of several hundred, but for user space it is crash (for example, accessing a missing page), for the kernel - default mode of operation.







In this article I will tell you some interesting facts in my opinion. It will not have futex



s and other boring (probably) implementation details. It will be mainly what caused me the reaction “And what, could it be so?!?”.







Firstly, some comments on the text before kat: some system calls have an optional interface in the form of a function from a shared object called vDSO , which the kernel puts into the process. There are few such functions (something around four, but the specific amount, apparently, may depend on the kernel version and architecture) - these are all sorts of time



and gettimeofday



, which, on the one hand, are often used, and on the other, they were implemented without switching to kernel context.







Secondly, SIGSEGV does not always end with a process crash, but we will talk about this userfaultfd



when it comes to userfaultfd



.







DISCLAIMER: Remember that using most of the features presented here, you are tying your Linux program. This is normal if in this way you do the optional optimization for a specific type of system or an additional feature that otherwise would simply not exist. But otherwise, I recommend thinking about how to make a cross-platform fallback.







General issues



For starters, how can all this be debugged? Of course strace



will help us! Since the set of system calls is limited, and most strace



knows “by sight”, it will not only show “the pointer 0x12345678 is passed”, but will describe what is being transferred in this or that direction in this structure. If strace



is fresh enough, then using the -k



option you can ask it to issue a call stack.







It looks something like this
 $ strace -k sleep 1 execve("/bin/sleep", ["sleep", "1"], 0x7ffe9f9cce30 /* 60 vars */) = 0 > /lib/x86_64-linux-gnu/libc-2.30.so(execve+0xb) [0xe601b] > /usr/bin/strace(+0x0) [0xa279c] > /usr/bin/strace(+0x0) [0xa41d2] > /usr/bin/strace(+0x0) [0x7090b] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /usr/bin/strace(+0x0) [0x7112a] brk(NULL) = 0x558936ded000 > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x20b) [0x1ccdb] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1cd2) [0x1b872] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] arch_prctl(0x3001 /* ARCH_??? */, 0x7fff593c0070) = -1 EINVAL ( ) > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e25) [0x1b9c5] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] access("/etc/ld.so.preload", R_OK) = -1 ENOENT (    ) > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x10cb) [0x1db9b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x3c12] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e7b) [0x1ba1b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x1238) [0x1dd08] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_debug_state+0x73a) [0x11d4a] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_exception_free+0x908) [0x189c8] > /lib/x86_64-linux-gnu/ld-2.30.so() [0xa362] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x41b5) [0xeb35] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_exception+0x65) [0x1ca85] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x4603) [0xef83] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x3c55] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e7b) [0x1ba1b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] fstat(3, {st_mode=S_IFREG|0644, st_size=254851, ...}) = 0 > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x1009) [0x1dad9] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_debug_state+0x761) [0x11d71] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_exception_free+0x908) [0x189c8] > /lib/x86_64-linux-gnu/ld-2.30.so() [0xa362] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x41b5) [0xeb35] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_exception+0x65) [0x1ca85] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x4603) [0xef83] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x3c55] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e7b) [0x1ba1b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] mmap(NULL, 254851, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fc49621c000 > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x1426) [0x1def6] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_debug_state+0x79d) [0x11dad] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_exception_free+0x908) [0x189c8] > /lib/x86_64-linux-gnu/ld-2.30.so() [0xa362] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x41b5) [0xeb35] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_exception+0x65) [0x1ca85] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x4603) [0xef83] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x3c55] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e7b) [0x1ba1b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] close(3) = 0 > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x10fb) [0x1dbcb] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_debug_state+0x780) [0x11d90] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_exception_free+0x908) [0x189c8] > /lib/x86_64-linux-gnu/ld-2.30.so() [0xa362] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x41b5) [0xeb35] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_exception+0x65) [0x1ca85] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x4603) [0xef83] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x3c55] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e7b) [0x1ba1b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x1238) [0x1dd08] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x7d40] > /lib/x86_64-linux-gnu/ld-2.30.so() [0xa3a8] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x41b5) [0xeb35] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_exception+0x65) [0x1ca85] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x4603) [0xef83] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x3c55] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e7b) [0x1ba1b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360r\2\0\0\0\0\0"..., 832) = 832 > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_error+0x12f8) [0x1ddc8] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x7d79] > /lib/x86_64-linux-gnu/ld-2.30.so() [0xa3a8] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x41b5) [0xeb35] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_catch_exception+0x65) [0x1ca85] > /lib/x86_64-linux-gnu/ld-2.30.so(_dl_rtld_di_serinfo+0x4603) [0xef83] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x3c55] > /lib/x86_64-linux-gnu/ld-2.30.so(__get_cpu_features+0x1e7b) [0x1ba1b] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x203c] > /lib/x86_64-linux-gnu/ld-2.30.so() [0x1108] ...    ... brk(NULL) = 0x558936ded000 > /lib/x86_64-linux-gnu/libc-2.30.so(brk+0xb) [0x11755b] > /lib/x86_64-linux-gnu/libc-2.30.so(__sbrk+0x67) [0x117617] > /lib/x86_64-linux-gnu/libc-2.30.so(__default_morecore+0xd) [0x9fd3d] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x2725) [0x9a745] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x3943) [0x9b963] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x3b2b) [0x9bb4b] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x4d9e) [0x9cdbe] > /lib/x86_64-linux-gnu/libc-2.30.so(textdomain+0x740) [0x3be70] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x1d35) [0x35515] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0xbdf) [0x343bf] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x215) [0x339f5] > /bin/sleep() [0x25f0] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /bin/sleep() [0x287e] brk(0x558936e0e000) = 0x558936e0e000 > /lib/x86_64-linux-gnu/libc-2.30.so(brk+0xb) [0x11755b] > /lib/x86_64-linux-gnu/libc-2.30.so(__sbrk+0x91) [0x117641] > /lib/x86_64-linux-gnu/libc-2.30.so(__default_morecore+0xd) [0x9fd3d] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x2725) [0x9a745] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x3943) [0x9b963] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x3b2b) [0x9bb4b] > /lib/x86_64-linux-gnu/libc-2.30.so(thrd_yield+0x4d9e) [0x9cdbe] > /lib/x86_64-linux-gnu/libc-2.30.so(textdomain+0x740) [0x3be70] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x1d35) [0x35515] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0xbdf) [0x343bf] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x215) [0x339f5] > /bin/sleep() [0x25f0] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /bin/sleep() [0x287e] openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3 > /lib/x86_64-linux-gnu/libc-2.30.so(__open64_nocancel+0x4c) [0x11679c] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x1ce9) [0x354c9] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0xbdf) [0x343bf] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x215) [0x339f5] > /bin/sleep() [0x25f0] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /bin/sleep() [0x287e] fstat(3, {st_mode=S_IFREG|0644, st_size=8994080, ...}) = 0 > /lib/x86_64-linux-gnu/libc-2.30.so(__fxstat64+0x19) [0x1107b9] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x1e33) [0x35613] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0xbdf) [0x343bf] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x215) [0x339f5] > /bin/sleep() [0x25f0] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /bin/sleep() [0x287e] mmap(NULL, 8994080, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fc495795000 > /lib/x86_64-linux-gnu/libc-2.30.so(mmap64+0x26) [0x11baf6] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x1e5d) [0x3563d] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0xbdf) [0x343bf] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x215) [0x339f5] > /bin/sleep() [0x25f0] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /bin/sleep() [0x287e] close(3) = 0 > /lib/x86_64-linux-gnu/libc-2.30.so(__close_nocancel+0xb) [0x1165bb] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x1eab) [0x3568b] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0xbdf) [0x343bf] > /lib/x86_64-linux-gnu/libc-2.30.so(setlocale+0x215) [0x339f5] > /bin/sleep() [0x25f0] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /bin/sleep() [0x287e] nanosleep({tv_sec=1, tv_nsec=0}, NULL) = 0 > /lib/x86_64-linux-gnu/libc-2.30.so(nanosleep+0x17) [0xe5d17] > /bin/sleep() [0x5827] > /bin/sleep() [0x5600] > /bin/sleep() [0x27b0] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xf3) [0x271e3] > /bin/sleep() [0x287e] close(1) = 0 > /lib/x86_64-linux-gnu/libc-2.30.so(__close_nocancel+0xb) [0x1165bb] > /lib/x86_64-linux-gnu/libc-2.30.so(_IO_file_close_it+0x70) [0x92fc0] > /lib/x86_64-linux-gnu/libc-2.30.so(fclose+0x166) [0x85006] > /bin/sleep() [0x5881] > /bin/sleep() [0x2d27] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_secure_getenv+0x127) [0x49ba7] > /lib/x86_64-linux-gnu/libc-2.30.so(exit+0x20) [0x49d60] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xfa) [0x271ea] > /bin/sleep() [0x287e] close(2) = 0 > /lib/x86_64-linux-gnu/libc-2.30.so(__close_nocancel+0xb) [0x1165bb] > /lib/x86_64-linux-gnu/libc-2.30.so(_IO_file_close_it+0x70) [0x92fc0] > /lib/x86_64-linux-gnu/libc-2.30.so(fclose+0x166) [0x85006] > /bin/sleep() [0x5881] > /bin/sleep() [0x2d4d] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_secure_getenv+0x127) [0x49ba7] > /lib/x86_64-linux-gnu/libc-2.30.so(exit+0x20) [0x49d60] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xfa) [0x271ea] > /bin/sleep() [0x287e] exit_group(0) = ? +++ exited with 0 +++ > /lib/x86_64-linux-gnu/libc-2.30.so(_exit+0x36) [0xe5fe6] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_secure_getenv+0x242) [0x49cc2] > /lib/x86_64-linux-gnu/libc-2.30.so(exit+0x20) [0x49d60] > /lib/x86_64-linux-gnu/libc-2.30.so(__libc_start_main+0xfa) [0x271ea] > /bin/sleep(+0x0) [0x287e]
      
      





True, source file names and line numbers are not displayed here. addr2line



will help addr2line



(if this information is in principle present, of course).







There is a second question: some system calls do not have wrappers in libc



. Then you can use the universal wrapper called syscall



:







  syscall(SYS_kcmp, getpid(), getpid(), KCMP_FILE, 1, fd)
      
      





A file is a very strange thing ...



System calls are not only a way to ask the kernel to access the hardware on behalf of the process. It is also a universal API that is understandable to all libraries in the system. So, if the functionality you need is not supported in the library, it will probably work out automatically if you ask for the kernel correctly. In addition, part of the "settings" of the process is inherited by execve



, so you can try to do this without complicated crutches by simply forming the state correctly before starting the process (something like "why manually transfer stderr



to a file if you can just open the file and do its FD # 2 for the child process ").







Once, I needed to subtract a sequence of network packets from a file. At some point, the number of crutches exceeded all reasonable limits, and I decided that it was unlikely that libpcap



would be more complicated than what I wrote, moreover, it was the standard, and there were generally accepted tools for opening these files. It turned out that using libpcap



to read dumps is about as difficult as fopen



to read files: you simply open the dump with pcap_(f)open_offline



and scoop out packets through pcap_next_ex



. Everything! Well, it's still worth closing the dump upon completion ...







But this is bad luck: it seems that libpcap



cannot read from memory. Maybe he can, of course, if you delve into it, but for our "laboratory" we’ll imagine that he can’t.







So, a model example: we are waiting on stdin



sequence of bytes, after which there is a dump aligned by 4 bytes. I understand that you can use buffered input and some ungetc



(since libpcap



still requires FILE *



), but in the general case, we can unpack it on the go, for example, or the library can work directly with read



/ write



.







Solution 1: memfd_create



The memfd_create



system call allows memfd_create



to create a "generally anonymous" file descriptor. The file is in memory and exists while at least one descriptor is open on it. In the simplest case, you just get such a descriptor, write data to it through write



, rewind lseek



, and with fdopen



let libc



know about it:







  int fd = memfd_create("pcap-dump-contents", 0); write(fd, buf, length); lseek(fd, 0, SEEK_SET); FILE *file = fdopen(fd, "r");
      
      





The file name passed in with the first argument will be displayed in a symbolic link in /proc/<PID>/fd



:







 $ ls -l /proc/31747/fd  0 lr-x------ 1 trosinenko trosinenko 64  10 13:12 0 -> /path/to/128test.pcap lrwx------ 1 trosinenko trosinenko 64  10 13:12 1 -> /dev/pts/17 lrwx------ 1 trosinenko trosinenko 64  10 13:12 2 -> /dev/pts/17 lrwx------ 1 trosinenko trosinenko 64  10 13:12 23 -> '/home/trosinenko/.cache/appstream-cache-AH3OA0.mdb (deleted)' lrwx------ 1 trosinenko trosinenko 64  10 13:12 3 -> '/memfd:pcap-dump-contents (deleted)' lrwx------ 1 trosinenko trosinenko 64  10 13:12 57 -> 'socket:[41036]'
      
      





Solution 2: open with the O_TMPFILE flag



On Linux, starting with a version, when creating a file, you can O_TMPFILE



and the directory name instead of the file name. As a result, the file, as one literary character used to say (approximately), it seems to be there, but it doesn't exist ... I don’t know if the data is written to disk, but it probably depends on the file system (by the way, it should support this mode) . The file still disappears when the last link is closed, but it can be attached to the directory tree using linkat



:







  int fd = open(".", O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR); assert(fd != -1); assert(write(fd, buffer + offset, len - offset) == len - offset); assert(lseek(fd, 0, SEEK_SET) == 0); const char *link_to = getenv("LINK_TO"); if (link_to != NULL) { char path[128]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); linkat(AT_FDCWD, path, AT_FDCWD, link_to, AT_SYMLINK_FOLLOW); }
      
      





In addition to the ability not to suffer with the file name, it makes it possible to fill out the file, configure the rights, etc., and then atomically link to the directory tree.







Example (for both approaches)
 #define _GNU_SOURCE #ifdef NDEBUG //    assert  -   # undef NDEBUG #endif #include <sys/mman.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <signal.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <pcap.h> //     PCAP (   -- . ) static const uint32_t pcap_mgc = 0xA1B2C3D4; char buffer[1 << 20]; int main() { int len = read(0, buffer, sizeof(buffer)); //  -      "", //     pcap_mgc,   //   4  .   ... int offset = -1; for (int i = 0; i < len; i += 4) { if (*(uint32_t *)(buffer + i) == pcap_mgc) { offset = i; break; } } if (offset >= 0) { printf("Found PCAP dump at offset %d\n", offset); } else { fprintf(stderr, "No PCAP dump found.\n"); exit(1); } //   ,  libpcap ,   //   . #if 0 int fd = memfd_create("pcap-dump-contents", 0); assert(fd != -1); assert(write(fd, buffer + offset, len - offset) == len - offset); assert(lseek(fd, 0, SEEK_SET) == 0); #else int fd = open(".", O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR); assert(fd != -1); assert(write(fd, buffer + offset, len - offset) == len - offset); assert(lseek(fd, 0, SEEK_SET) == 0); const char *link_to = getenv("LINK_TO"); if (link_to != NULL) { char path[128]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); linkat(AT_FDCWD, path, AT_FDCWD, link_to, AT_SYMLINK_FOLLOW); } #endif raise(SIGSTOP); //    /proc/PID/fd/ //      - ... FILE *file = fdopen(fd, "r"); char errbuf[PCAP_ERRBUF_SIZE]; pcap_t * dump = pcap_fopen_offline(file, errbuf); assert(dump != NULL); struct pcap_pkthdr *hdr; const uint8_t *data; while (pcap_next_ex(dump, &hdr, &data) == 1) { printf("Read packet: full length = %d bytes, available %d bytes.\n", hdr->len, hdr->caplen); } return 0; }
      
      





 $ fallocate -l 128 zero128 $ cat zero128 test.pcap > 128test.pcap $ ./memfd < 128test.pcap Found PCAP dump at offset 128 Read packet: full length = 105 bytes, available 105 bytes. Read packet: full length = 105 bytes, available 105 bytes. Read packet: full length = 66 bytes, available 66 bytes. Read packet: full length = 385 bytes, available 385 bytes. Read packet: full length = 66 bytes, available 66 bytes. ...
      
      





userfaultfd: handling memory errors in userspace



I think there will not be something very new in saying that on UNIX-like systems, file descriptors just don't indicate anything. For example, on Linux it can be a socket, pipe, eventfd, or even a link to an ebpf program. But perhaps this example will still surprise you. At the beginning of the article I talked about the fact that page faults are a common thing for the kernel: swap, copy-on-write, that's all ... When the user process “misses”, SIGSEGV is sent to it. As far as I know, returning control from the SIGSEGV handler generated by the kernel is undefined behavior, and nevertheless, there is the GNU libsigsegv library generalizing the features of handling memory access errors on various platforms, even Windows (ATTENTION: the GPL license, if not ready for distribute your program, do not use libsigsegv) . Not so long ago, a completely documented method appeared in Linux, called userfaultfd



: using the system call of the same name, you open a file descriptor, reading and writing to which special structures are commands.







With such a file descriptor, you can mark a range of virtual addresses for your process. After that, at the first access to each marked page of memory, the flow will fall asleep, and reading from the file descriptor will return information about what happened. After that, the handler will fill in the response structure with a pointer to the data that needs to be used to initialize the "problem" page, the kernel will initialize it and wake up the thread that turned. In this case, it is assumed that there is a separate stream whose responsibilities include reading commands from the descriptor and issuing answers. Generally speaking, other information can userfaultfd



be obtained through userfaultfd



, for example, some notifications about a change in the virtual memory card of a process.







Usage example
 #define _GNU_SOURCE #ifdef NDEBUG //    assert  -   # undef NDEBUG #endif #include <linux/userfaultfd.h> #include <syscall.h> #include <sys/mman.h> #include <sys/ioctl.h> #include <unistd.h> #include <pthread.h> #include <assert.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> // -,   sysconf... #define PAGE_SIZE 4096 #define PAGE_MASK (PAGE_SIZE - 1) static void *thread_fn(void * arg) { int uffd = (intptr_t)arg; struct uffd_msg msg; // ,    hugepages... uint8_t *replacement_page = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1 ,0); while(1) { assert(read(uffd, &msg, sizeof msg) > 0); //    ,    if (msg.event == UFFD_EVENT_PAGEFAULT) { uintptr_t addr = msg.arg.pagefault.address; fprintf(stderr, "Fault: addr = 0x%zx\n", addr); //       uint8_t *page_addr = (uint8_t *)((uintptr_t)addr & ~PAGE_MASK); //  "" ,  ""! memset(replacement_page, 0xAB, PAGE_SIZE); //     struct uffdio_copy copy; copy.src = (uintptr_t)replacement_page; copy.dst = (uintptr_t)page_addr; copy.mode = 0; // ,  --   copy.copy = 0; //   --      copy.len = PAGE_SIZE; assert(ioctl(uffd, UFFDIO_COPY, &copy) != -1); } } } static int init_userfaultfd(void) { //   int uffd = syscall(__NR_userfaultfd, 0); // ,       struct uffdio_api api; api.api = UFFD_API; api.features = 0; assert(ioctl(uffd, UFFDIO_API, &api) != -1); fprintf(stderr, "UFFD open\n"); //  - pthread_t thread; memset(&thread, 0, sizeof(thread)); // ...    int  void *? pthread_create(&thread, 0, thread_fn, (void *)(intptr_t)uffd); return uffd; } static void register_region(int uffd, void * aligned_addr, size_t size) { struct uffdio_register reg; memset(&reg, 0, sizeof reg); reg.range.start = (uintptr_t)aligned_addr; reg.range.len = size; reg.mode = UFFDIO_REGISTER_MODE_MISSING; assert (ioctl(uffd, UFFDIO_REGISTER, &reg) != -1); } int main() { void *addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); int uffd = init_userfaultfd(); register_region(uffd, addr, PAGE_SIZE); fprintf(stderr, "Before reading\n"); fprintf(stderr, "Data at %p: %x\n", addr, *(volatile int *)addr); return 0; }
      
      





 $ ./userfaultfd UFFD open Before reading Fault: addr = 0x7f46f40d5000 Data at 0x7f46f40d5000: abababab
      
      





“The key question of mathematics: is it all the same?” ©



What if you need to find out if this file descriptor refers to stdin



? It would seem that if (fd == 0) ...



- and that's it. OK...







 #define _GNU_SOURCE #include <unistd.h> #include <stdio.h> int main() { int fd = dup(0); printf("stdin is fd %d, too\n", fd); if (fd == 0) printf("stdin"); else printf("not stdin"); return 0; }
      
      





 $ gcc kcmp.c -o kcmp $ ./kcmp stdin is fd 3, too not stdin
      
      





Oops ... The descriptor is sort of like one, but the aliases are different. CRIU - Checkpoint / Restore In Userspace will help us . , , . userspace-, , , kcmp



: PID, , , , , :







 #define _GNU_SOURCE #include <linux/kcmp.h> #include <syscall.h> #include <unistd.h> #include <stdio.h> int main() { int fd = dup(0); printf("stdin is fd %d, too\n", fd); int pid = getpid(); if (syscall(SYS_kcmp, pid, pid, KCMP_FILE /*    _FILES! */, 0 /* stdin fd */, fd) == 0) printf("stdin\n"); else printf("not stdin\n"); if (syscall(SYS_kcmp, pid, pid, KCMP_FILE, 1 /* stdout fd */, fd) == 0) printf("stdout\n"); else printf("not stdout\n"); return 0; }
      
      





 $ ./kcmp stdin is fd 3, too stdin stdout
      
      





! , ...







 $ ls -l /proc/self/fd  0 lrwx------ 1 trosinenko trosinenko 64  10 14:45 0 -> /dev/pts/17 lrwx------ 1 trosinenko trosinenko 64  10 14:45 1 -> /dev/pts/17 lrwx------ 1 trosinenko trosinenko 64  10 14:45 2 -> /dev/pts/17 lrwx------ 1 trosinenko trosinenko 64  10 14:45 23 -> '/home/trosinenko/.cache/appstream-cache-AH3OA0.mdb (deleted)' lr-x------ 1 trosinenko trosinenko 64  10 14:45 3 -> /proc/17265/fd lrwx------ 1 trosinenko trosinenko 64  10 14:45 57 -> 'socket:[41036]'
      
      





, , , , . , bash , ls



!







 $ ./kcmp < kcmp.c stdin is fd 3, too stdin not stdout
      
      





, — -, best effort - .









, ...









oldolduname



...








All Articles