Capabilities are being used more and more thanks in large part to SystemD, Docker, and orchestrators such as Kubernetes. But, as it seems to me, the documentation is a little difficult to understand and some parts of the privilege implementation for me were a bit confusing, so I decided to share my current knowledge in this short article.
The most important privilege link is the capabilities (7) man page. But she is not very suitable for an initial acquaintance.
Process capabilities
The rights of ordinary users are very limited, while the rights of the “root” user are very extensive. Although processes running as "root" often do not require all root privileges.
To reduce the root privileges, POSIX permissions provide a way to limit the groups of privileged system operations that a process and its descendants are allowed to perform. In fact, they divide all “root” rights into a set of separate privileges. The idea of capabilities was described in 1997 in a draft of POSIX 1003.1e.
On Linux, each process (task) has five 64-bit numbers (sets) containing permission bits (before Linux 2.6.25 they were 32-bit), which can be viewed in
/ proc / <pid> / status.
CapInh: 00000000000004c0 CapPrm: 00000000000004c0 CapEff: 00000000000004c0 CapBnd: 00000000000004c0 CapAmb: 0000000000000000
These numbers (shown here in hexadecimal notation) are bitmaps in which permission sets are represented. Here are their full names:
- Inheritable - Permissions that descendants can inherit
- Permitted - Permissions that can be used by the task.
- Effective - current effective permissions
- Bounding - Before Linux 2.6.25, the bounding set was a system-wide attribute common to all threads, designed to describe a set beyond which permissions could not be expanded. This is currently a set for each task and is only part of the execve logic, details below.
- Ambient (external since Linux 4.3) - added to make it easier to provide non-root permissions to the user, without using setuid or file permissions (details later).
If the task asks for a privileged operation (for example, binding to ports <1024), the kernel checks the current bounding set for CAP_NET_BIND_SERVICE . If it is installed, then the operation continues. Otherwise, the operation is rejected with EPERM (operation not allowed). These
CAP_
in the kernel source code and are numbered sequentially, so
CAP_NET_BIND_SERVICE
, equal to 10, means bit 1 << 10 = 0x400 (this is the hexadecimal digit “4” in my previous example).
A complete human-readable list of privileges currently defined can be found in the current capabilities (7) man page (the list here is for reference only).
In addition, there is a libcap library to simplify management and authorization checks. In addition to the library API , the package includes the capsh utility, which, among other things, allows you to show your credentials.
# capsh --print Current: = cap_setgid,cap_setuid,cap_net_bind_service+eip Bounding set = cap_setgid,cap_setuid,cap_net_bind_service Ambient set = Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) secure-no-ambient-raise: no (unlocked) uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
There are some confusing points here:
- Current - displays the effective, inherited and available privileges of the capsh process in the format cap_to_text (3) . In this format, rights are listed as permission groups
“capability[,capability…]+(e|i|p)”
, where“e”
means effective,“i”
inherited, and“p”
available. The list is not separated by the“,”
symbol, as you might have guessed(cap_setgid+eip, cap_setuid+eip)
. A comma splits permissions in one action group. The actual list of action groups is then separated by spaces. Another example with two action groups would be“= cap_sys_chroot+ep cap_net_bind_service+eip”
. And also the following two groups of actions“= cap_net_bind_service+e cap_net_bind_service+ip”
will encode the same value as one“cap_net_bind_service+eip”
. - Bounding set / Ambient set . To further confuse, these two lines contain only a list of permissions defined in these sets, separated by spaces. The cap_to_text format is not used here, because it does not contain sets of available, effective and inherited permissions, but only one (bounding / ambient) set.
- Securebits : displays the securebits flags of the task in decimal / hexadecimal / in Verilog format (yes, everyone expects it here, and this is perfectly clear from the point that each system administrator programs their own
FPGA
andASIC
). The following is the state of securebits. Actual flags are defined asSECBIT_*
in securebits.h , and are also described in capabilities (7) . - This utility lacks the display of “NoNewPrivs” information , which can be viewed in
/ proc / <pid> / status
. It is mentioned only in prctl (2), although it directly affects rights when used together with file permissions (in more detail below). NoNewPrivs is described as follows: “Withno_new_privs
to 1, execve (2) promises not to grant privileges to what could not be done without calling execve (2) (for example, processing theset-user-ID
,set-group-ID
bitsset-group-ID
and disabling file permissions processing) . After installation, theno_new_privs
attribute cannot be reset. The value of this attribute is inherited by descendants created through fork (2) and clone (2), and stored through execve (2). ” Kubernetes sets this flag to 1 when allowPrivilegeEscalation is false in the pod securityContext.
When starting a new process through execve (2), the permissions for the child process are converted using the formula specified in capabilities (7) :
P'(ambient) = (file is privileged) ? 0 : P(ambient) P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & P(bounding)) | P'(ambient) P'(effective) = F(effective) ? P'(permitted) : P'(ambient) P'(inheritable) = P(inheritable) [ie, unchanged] P'(bounding) = P(bounding) [ie, unchanged] where: P() denotes the value of a thread capability set before the execve(2) - execve(2) P'() denotes the value of a thread capability set after the execve(2) - execve(2) F() denotes a file capability set -
These rules describe the actions performed for each bit in all permission sets (ambient / permitted / effective / inheritable / bounding). The standard C syntax is used (& - for logical AND, | - for logical OR). P 'is a child process. P is the current process calling execve (2). F are the so-called “file permissions” of a file launched through execve.
In addition, a process can programmatically change its inherited, accessible, and efficient sets with libcap at any time according to the following rules:
- If the caller does not have
CAP_SETPCAP
, the new inherited set must be a subset of P (inherited) & P (available) - (with Linux 2.6.25) The new inherited set should be a subset of P (inherited) & P (limiting)
- The new available set should be a subset of P (available)
- The new efficient set should be a subset of P (effective)
File permissions
Sometimes a user with a limited set of rights needs to run a file that requires more privileges. Previously, this was achieved by setting the setuid bit (
chmod + s ./executable
) in a binary file. Such a file, if it belongs to root, will have full root rights when executed by any user.
But this mechanism grants too many file privileges, so POSIX permissions have implemented a concept called “file permissions”. They are stored as an extended file attribute called “security.capability”, so you need a file system with support for extended attributes (ext *, XFS, Raiserfs, Brtfs, overlay2, ...). To change this attribute,
CAP_SETFCAP
permission is
CAP_SETFCAP
(in the available set of process permissions).
$ getfattr -m - -d `which ping` # file: usr/bin/ping security.capability=0sAQAAAgAgAAAAAAAAAAAAAAAAAAA= $ getcap `which ping` /usr/bin/ping = cap_net_raw+ep
Special cases and comments
Of course, in reality, everything is not so simple, and there are several special cases described in the capabilities (7) man page. Probably the most important of them are:
- The setuid bit and file permissions are ignored if NoNewPrivs is installed or the file system is mounted with nosuid or the process calling execve is traced by ptrace. File permissions are also ignored when the kernel boots with the
no_file_caps
option. - A “dumb” file (capability-dumb) is a binary file converted from a setuid file to a file with file permissions, but without changing its source code. Such files are often obtained by setting + ep permissions on them, for example,
“setcap cap_net_bind_service+ep ./binary”
. The important part is “e” - effective. After execve, these permissions will be added to both available and existing ones, so the executable will be ready to use the privileged operation. In contrast, a “capability-smart” file that uses libcap or similar functionality can use cap_set_proc (3) (or capset ) to set the “effective” or “inherited” bits at any time if that permission is already at ” affordable ”kit. Therefore, “setcap cap_net_bind_service+p ./binary”
will be enough for a “smart” file, since it will be able to set the necessary permissions in an effective set itself before invoking a privileged operation. See sample code . - Files with setuid-root continue to work, providing all root privileges when a user starts as non-root. But if they have file permissions set, then only they will be granted. You can also create a setuid file with an empty set of permissions, which will make it run as a user with UID 0 without any permissions. There are special cases for the root user when running a file with setuid-root and setting various securebits flags (see man).
- A bounding set masks available permissions, but not inherited ones. Remember P '(available) = F (available) & P (limiting). If a stream has a permission in its inherited set that is not in its bounding set, then it can still get that permission in its accessible set by running a file that has permission in its inherited set - P '(available) = P ( inherited) & F (inherited).
- Executing a program that changes the UID or GID through the set-user-ID, set-group-ID bits, or executing a program for which any file permissions are set, will clear the ambient set . Permissions are added to the surrounding set using
PR_CAP_AMBIENT
prctl . These permissions should already be present in both accessible and inherited process sets . - If a process with a UID other than 0 executes execve (2) , then all rights present in its available and active sets will be deleted.
- If
SECBIT_KEEP_CAPS
(or the widerSECBIT_NO_SETUID_FIXUP
) is not set, and changing the UID from 0 to nonzero removes all permissions from the inherited, accessible, and effective sets .
So…
If the official nginx container, ingress-nginx or your own stops or restarts with an error:
bind() to 0.0.0.0:80 failed (13: Permission denied)
... this means that there was an attempt to listen on port 80 as an unprivileged (not 0) user, and there was no
CAP_NET_BIND_SERVICE
in the current permission
CAP_NET_BIND_SERVICE
. To obtain these rights, you must use xattr and set (using
setcap
) the nginx file for permission at least
cap_net_bind_service+ie
. This file permission will be combined with the legacy set (specified along with the bounding set from pod SecurityContext / capability / add / NET_BIND_SERVICE), and will also be placed in the set of available permissions. The result is
cap_net_bind_service+pie
.
This all works as long as securityContext / allowPrivilegeEscalation is set to true and the docker / rkt storage driver (see docker documentation) supports xattrs.
If nginx were smart with respect to permissions, then
cap_net_bind_service+i
would be enough. Then he could use libcap to expand the rights from the available set to effective. Having received as a result
cap_net_bind_service+pie
.
Besides using xattr, the only way to get
cap_net_bind_service
in a non-root container is to let Docker set its external capabilities (ambient capabilities). But as of April 2019, this has not yet been implemented .
Code examples
Here is sample code using libcap to add
CAP_NET_BIND_SERVICE
to an efficient permission set. It requires permission
CAP_BIND_SERVICE+p
for the binary file.
References (eng.):