In a nutshell about Linux privileges (capabilities)

A translation of the article was prepared specifically for students of the Linux Administrator course.




Capabilities are being used more and more thanks in large part to SystemD, Docker, and orchestrators such as Kubernetes. But, as it seems to me, the documentation is a little difficult to understand and some parts of the privilege implementation for me were a bit confusing, so I decided to share my current knowledge in this short article.







The most important privilege link is the capabilities (7) man page. But she is not very suitable for an initial acquaintance.



Process capabilities



The rights of ordinary users are very limited, while the rights of the “root” user are very extensive. Although processes running as "root" often do not require all root privileges.



To reduce the root privileges, POSIX permissions provide a way to limit the groups of privileged system operations that a process and its descendants are allowed to perform. In fact, they divide all “root” rights into a set of separate privileges. The idea of ​​capabilities was described in 1997 in a draft of POSIX 1003.1e.



On Linux, each process (task) has five 64-bit numbers (sets) containing permission bits (before Linux 2.6.25 they were 32-bit), which can be viewed in
  / proc / <pid> / status 
.



CapInh: 00000000000004c0 CapPrm: 00000000000004c0 CapEff: 00000000000004c0 CapBnd: 00000000000004c0 CapAmb: 0000000000000000
      
      





These numbers (shown here in hexadecimal notation) are bitmaps in which permission sets are represented. Here are their full names:





If the task asks for a privileged operation (for example, binding to ports <1024), the kernel checks the current bounding set for CAP_NET_BIND_SERVICE . If it is installed, then the operation continues. Otherwise, the operation is rejected with EPERM (operation not allowed). These CAP_



in the kernel source code and are numbered sequentially, so CAP_NET_BIND_SERVICE



, equal to 10, means bit 1 << 10 = 0x400 (this is the hexadecimal digit “4” in my previous example).



A complete human-readable list of privileges currently defined can be found in the current capabilities (7) man page (the list here is for reference only).



In addition, there is a libcap library to simplify management and authorization checks. In addition to the library API , the package includes the capsh utility, which, among other things, allows you to show your credentials.



 # capsh --print Current: = cap_setgid,cap_setuid,cap_net_bind_service+eip Bounding set = cap_setgid,cap_setuid,cap_net_bind_service Ambient set = Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) secure-no-ambient-raise: no (unlocked) uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
      
      





There are some confusing points here:







When starting a new process through execve (2), the permissions for the child process are converted using the formula specified in capabilities (7) :



 P'(ambient) = (file is privileged) ? 0 : P(ambient) P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & P(bounding)) | P'(ambient) P'(effective) = F(effective) ? P'(permitted) : P'(ambient) P'(inheritable) = P(inheritable) [ie, unchanged] P'(bounding) = P(bounding) [ie, unchanged] where: P() denotes the value of a thread capability set before the execve(2) -      execve(2) P'() denotes the value of a thread capability set after the execve(2) -      execve(2) F() denotes a file capability set -  
      
      







These rules describe the actions performed for each bit in all permission sets (ambient / permitted / effective / inheritable / bounding). The standard C syntax is used (& - for logical AND, | - for logical OR). P 'is a child process. P is the current process calling execve (2). F are the so-called “file permissions” of a file launched through execve.



In addition, a process can programmatically change its inherited, accessible, and efficient sets with libcap at any time according to the following rules:







File permissions



Sometimes a user with a limited set of rights needs to run a file that requires more privileges. Previously, this was achieved by setting the setuid bit ( chmod + s ./executable



) in a binary file. Such a file, if it belongs to root, will have full root rights when executed by any user.



But this mechanism grants too many file privileges, so POSIX permissions have implemented a concept called “file permissions”. They are stored as an extended file attribute called “security.capability”, so you need a file system with support for extended attributes (ext *, XFS, Raiserfs, Brtfs, overlay2, ...). To change this attribute, CAP_SETFCAP



permission is CAP_SETFCAP



(in the available set of process permissions).



 $ getfattr -m - -d `which ping` # file: usr/bin/ping security.capability=0sAQAAAgAgAAAAAAAAAAAAAAAAAAA= $ getcap `which ping` /usr/bin/ping = cap_net_raw+ep
      
      







Special cases and comments



Of course, in reality, everything is not so simple, and there are several special cases described in the capabilities (7) man page. Probably the most important of them are:







So…



If the official nginx container, ingress-nginx or your own stops or restarts with an error:



bind() to 0.0.0.0:80 failed (13: Permission denied)







... this means that there was an attempt to listen on port 80 as an unprivileged (not 0) user, and there was no CAP_NET_BIND_SERVICE



in the current permission CAP_NET_BIND_SERVICE



. To obtain these rights, you must use xattr and set (using setcap



) the nginx file for permission at least cap_net_bind_service+ie



. This file permission will be combined with the legacy set (specified along with the bounding set from pod SecurityContext / capability / add / NET_BIND_SERVICE), and will also be placed in the set of available permissions. The result is cap_net_bind_service+pie



.



This all works as long as securityContext / allowPrivilegeEscalation is set to true and the docker / rkt storage driver (see docker documentation) supports xattrs.



If nginx were smart with respect to permissions, then cap_net_bind_service+i



would be enough. Then he could use libcap to expand the rights from the available set to effective. Having received as a result cap_net_bind_service+pie



.



Besides using xattr, the only way to get cap_net_bind_service



in a non-root container is to let Docker set its external capabilities (ambient capabilities). But as of April 2019, this has not yet been implemented .



Code examples



Here is sample code using libcap to add CAP_NET_BIND_SERVICE



to an efficient permission set. It requires permission CAP_BIND_SERVICE+p



for the binary file.



References (eng.):






All Articles