EXpress Data Path (XDP) technology allows arbitrary processing of traffic on Linux interfaces before packets arrive on the kernel network stack. Application of XDP - protection against DDoS attacks (CloudFlare), sophisticated filters, statistics collection (Netflix). XDP programs are executed by the eBPF virtual machine, therefore, they have restrictions both on their code and on available kernel functions, depending on the type of filter.
The article is intended to fill the shortcomings of numerous XDP materials. Firstly, they provide ready-made code that immediately bypasses the features of XDP: prepared for verification or too simple to cause problems. When you try to write your code from scratch from scratch, there is no understanding of what to do with typical errors. Secondly, methods for locally testing XDP without VMs and hardware are not covered, despite the fact that they have their own pitfalls. The text is intended for programmers familiar with networking and Linux who are interested in XDP and eBPF.
  In this part, we will examine in detail how the XDP filter is assembled and how to test it, then we will write a simple version of the well-known SYN cookie mechanism at the packet processing level.  While we will not form a βwhite listβ 
      
        
        
        
      
      verified customers, keep counters and manage the filter - enough logs. 
We will write in C - this is not fashionable, but practical. All code is available on GitHub via the link at the end and is divided into commits according to the stages described in the article.
Disclaimer. During the article, a mini-solution will be developed for repelling from DDoS attacks, because this is a realistic task for XDP and my area. However, the main goal is to deal with technology, this is not a guide to creating a ready-made protection. The training code is not optimized and omits some nuances.
XDP at a Glance
I will only outline key points in order not to duplicate the documentation and existing articles.
 So, the filter code is loaded into the kernel.  Inbound packets are sent to the filter.  As a result, the filter should make a decision: skip the packet to the kernel ( XDP_PASS
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ), discard the packet ( XDP_DROP
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ) or send it back ( XDP_TX
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ).  The filter can change the package, this is especially true for XDP_TX
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  You can also crash the program ( XDP_ABORTED
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ) and discard the package, but this is an analogue of assert(0)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     for debugging. 
The eBPF virtual machine (extended Berkley Packet Filter) is specially made simple so that the kernel can verify that the code does not loop and does not damage someone else's memory. Aggregate restrictions and checks:
- Forbidden cycles (transitions back).
- There is a stack for data, but no functions (all C functions must be inlined).
- Access to memory outside the stack and packet buffer is prohibited.
- The code size is limited, but in practice this is not very significant.
- Calls are allowed only to special kernel functions (eBPF helpers).
The design and installation of the filter looks like this:
-   The source code (for example, kernel.c
 
 
 
 ) is compiled into the object (kernel.o
 
 
 
 ) for the architecture of the eBPF virtual machine. As of October 2019, compilation in eBPF is supported by Clang and promised in GCC 10.1.
- If this object code contains calls to kernel structures (for example, tables and counters), zeros are used instead of their IDs, that is, such code cannot be executed. Before loading into the kernel, you need to replace these zeros with the ID of specific objects created through kernel calls (link code). You can do this with external utilities, or you can write a program that will link and load a specific filter.
- The kernel verifies the loaded program. The absence of loops and absenteeism beyond the boundaries of the packet and stack are checked. If the verifier cannot prove that the code is correct, the program is rejected - you must be able to please it.
- After successful verification, the kernel compiles the object code of the eBPF architecture into the machine code of the system architecture (just-in-time).
- The program attaches to the interface and begins to process packets.
Since XDP works in the kernel, debugging is done by trace logs and, in fact, by packets that the program filters or generates. However, eBPF provides security for the loaded code for the system, so you can experiment with XDP directly on local Linux.
Environment preparation
Assembly
Clang cannot directly issue object code for the eBPF architecture, so the process consists of two steps:
-   Compile C code into LLVM bytecode ( clang -emit-llvm
 
 
 
 ).
-   Convert bytecode to eBPF object code ( llc -march=bpf -filetype=obj
 
 
 
 ).
  When writing a filter, a couple of files with auxiliary functions and macros from kernel tests come in handy.  It is important that they match the kernel version ( KVER
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ).  Download them in helpers/
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     : 
 export KVER=v5.3.7 export BASE=https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/plain/tools/testing/selftests/bpf wget -P helpers --content-disposition "${BASE}/bpf_helpers.h?h=${KVER}" "${BASE}/bpf_endian.h?h=${KVER}" unset KVER BASE
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      Makefile for Arch Linux (kernel 5.3.7):
CLANG ?= clang LLC ?= llc KDIR ?= /lib/modules/$(shell uname -r)/build ARCH ?= $(subst x86_64,x86,$(shell uname -m)) CFLAGS = \ -Ihelpers \ \ -I$(KDIR)/include \ -I$(KDIR)/include/uapi \ -I$(KDIR)/include/generated/uapi \ -I$(KDIR)/arch/$(ARCH)/include \ -I$(KDIR)/arch/$(ARCH)/include/generated \ -I$(KDIR)/arch/$(ARCH)/include/uapi \ -I$(KDIR)/arch/$(ARCH)/include/generated/uapi \ -D__KERNEL__ \ \ -fno-stack-protector -O2 -g xdp_%.o: xdp_%.c Makefile $(CLANG) -c -emit-llvm $(CFLAGS) $< -o - | \ $(LLC) -march=bpf -filetype=obj -o $@ .PHONY: all clean all: xdp_filter.o clean: rm -f ./*.o
  KDIR
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     contains the path to the kernel headers, ARCH
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     - the architecture of the system.  Paths and tools may vary slightly between distributions. 
# CLANG ?= clang LLC ?= llc-7 # KDIR ?= /usr/src/linux-headers-$(shell uname -r) ARCH ?= $(subst x86_64,x86,$(shell uname -m)) # -I CFLAGS = \ -Ihelpers \ \ -I/usr/src/linux-headers-4.19.0-6-common/include \ -I/usr/src/linux-headers-4.19.0-6-common/arch/$(ARCH)/include \ #
  CFLAGS
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     a directory with auxiliary headers and several directories with kernel headers.  The __KERNEL__
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     symbol means that UAPI headers (userspace APIs) are defined for the kernel code, since the filter runs in the kernel. 
  Stack protection can be disabled ( -fno-stack-protector
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ), because the eBPF code verifier still checks that the stack is not out of bounds.  Optimization should be included right away because the size of the eBPF bytecode is limited. 
Let's start with a filter that skips all packets and does nothing:
#include <uapi/linux/bpf.h> #include <bpf_helpers.h> SEC("prog") int xdp_main(struct xdp_md* ctx) { return XDP_PASS; } char _license[] SEC("license") = "GPL";
  The make
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     command xdp_filter.o
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  Where to test it now? 
Test stand
The stand should include two interfaces: on which there will be a filter and from which packets will be sent. These must be full-fledged Linux devices with their IPs in order to test how regular applications work with our filter.
  Devices like veth (virtual Ethernet) are suitable for us: they are a pair of virtual network interfaces that are βconnectedβ directly to each other.  You can create them like this (in this section, all ip
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     commands are executed as root
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ): 
 ip link add xdp-remote type veth peer name xdp-local
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        Here xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and xdp-local
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     are device names.  A filter will be attached to xdp-local
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     (192.0.2.1/24), and incoming traffic will be sent from xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     (192.0.2.2/24).  However, there is a problem: the interfaces are on the same machine, and Linux will not send traffic to one of them through the other.  You can solve this with the tricky iptables
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     rules, but they will have to change the packages, which is inconvenient when debugging.  It is better to use network namespaces (network namespaces, hereinafter netns). 
The network namespace contains a set of interfaces, routing tables, and NetFilter rules, isolated from similar objects in other netns. Each process runs in a namespace, and only objects of this netns are accessible to it. By default, the system has a single network namespace for all objects, so you can work on Linux and not know about netns.
  Create a new xdp-test
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     namespace and move xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
 ip netns add xdp-test ip link set dev xdp-remote netns xdp-test
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        Then the process running in xdp-test
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     will not βseeβ xdp-local
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     (it will remain in netns by default) and will send it via xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     when sending a packet to 192.0.2.1, because this is the only interface in 192.0.2.0/ 24 available to this process.  This also works in the opposite direction. 
  When moving between netns, the interface drops and loses the address.  To configure the interface in netns, you need to start ip ...
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     in this ip netns exec
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     command namespace: 
 ip netns exec xdp-test \ ip address add 192.0.2.2/24 dev xdp-remote ip netns exec xdp-test \ ip link set xdp-remote up
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        As you can see, this is no different from setting xdp-local
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     in the default namespace: 
  ip address add 192.0.2.1/24 dev xdp-local ip link set xdp-local up
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        If you run tcpdump -tnevi xdp-local
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , you can see that packets sent from xdp-test
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     are delivered to this interface: 
 ip netns exec xdp-test ping 192.0.2.1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        It is convenient to run the shell in xdp-test
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  The repository has a script that automates the work with the stand, for example, you can configure the stand with the sudo ./stand up
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     command and delete it with the sudo ./stand down
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     command. 
Trace
The filter is attached to the device as follows:
 ip -force link set dev xdp-local xdp object xdp_filter.o verbose
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        The -force
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     needed to bind a new program if another is already bound.  βNo news is good newsβ is not about this command, the conclusion is in any case voluminous.  verbose
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     optional, but with it a report appears on the work of the code verifier with the assembly listing: 
Verifier analysis: 0: (b7) r0 = 2 1: (95) exit
Unlink the program from the interface:
 ip link set dev xdp-local xdp off
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        In the script, these are the sudo ./stand attach
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and sudo ./stand detach
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
  By attaching a filter, you can verify that ping
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     continues to work, but does the program work?  Add the logs.  The bpf_trace_printk()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     function is similar to printf()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , but supports only up to three arguments, except for the template, and a limited list of qualifiers.  The bpf_printk()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     macro simplifies the call. 
SEC("prog") int xdp_main(struct xdp_md* ctx) { + bpf_printk("got packet: %p\n", ctx); return XDP_PASS; }
The output goes to the kernel trace channel, which you need to enable:
 echo -n 1 | sudo tee /sys/kernel/debug/tracing/options/trace_printk
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      View message flow:
cat /sys/kernel/debug/tracing/trace_pipe
  Both of these commands make a call to sudo ./stand log
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     . 
Ping should now trigger the following messages in it:
<...>-110930 [004] ..s1 78803.244967: 0: got packet: 00000000ac510377
If you look closely at the verifierβs output, you will notice strange calculations:
0: (bf) r3 = r1 1: (18) r1 = 0xa7025203a7465 3: (7b) *(u64 *)(r10 -8) = r1 4: (18) r1 = 0x6b63617020746f67 6: (7b) *(u64 *)(r10 -16) = r1 7: (bf) r1 = r10 8: (07) r1 += -16 9: (b7) r2 = 16 10: (85) call bpf_trace_printk#6 <...>
The fact is that eBPF programs do not have a data section, so the only way to encode a format string is with the immediate arguments of the VM commands:
 $ python -c "import binascii; print(bytes(reversed(binascii.unhexlify('0a7025203a74656b63617020746f67'))))" b'got packet: %p\n'
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      For this reason, the debug output greatly inflates the resulting code.
Sending XDP Packages
Let's change the filter: let it send all incoming packets back. This is incorrect from a network point of view, since it would be necessary to change the addresses in the headers, but now work is important in principle.
bpf_printk("got packet: %p\n", ctx); - return XDP_PASS; + return XDP_TX; }
  Run tcpdump
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     on xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  It should show identical outgoing and incoming ICMP Echo Request and stop showing ICMP Echo Reply.  But does not show.  It turns out that for XDP_TX
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     to work in a program on xdp-local
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     it is necessary that the xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     interface be xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     interface, even if it is empty, and it should be raised. 
The perf events mechanism, by the way, using the same virtual machine allows tracing the package path in the kernel , that is, eBPF is used for disassembling with eBPF.
You have to make good out of evil, because there is nothing more to make of it.
 $ sudo perf trace --call-graph dwarf -e 'xdp:*' 0.000 ping/123455 xdp:xdp_bulk_tx:ifindex=19 action=TX sent=0 drops=1 err=-6 veth_xdp_flush_bq ([veth]) veth_xdp_flush_bq ([veth]) veth_poll ([veth]) <...>
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      What is code 6?
$ errno 6 ENXIO 6 No such device or address
  The function veth_xdp_flush_bq()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     receives an error code from veth_xdp_xmit()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , where we search by ENXIO
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and find a comment. 
  Restore the minimum filter ( XDP_PASS
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ) in the xdp_dummy.c
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     file, add it to the Makefile, attach it to xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     : 
 ip netns exec remote \ ip link set dev int xdp object dummy.o
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
        Now tcpdump
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     shows what is expected: 
62:57:8e:70:44:64 > 26:0e:25:37:8f:96, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 13762, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.2 > 192.0.2.1: ICMP echo request, id 46966, seq 1, length 64 62:57:8e:70:44:64 > 26:0e:25:37:8f:96, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 13762, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.2 > 192.0.2.1: ICMP echo request, id 46966, seq 1, length 64
  If only ARP is shown instead, you need to remove the filters (this is done by sudo ./stand detach
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ), start ping
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , then set the filters and try again.  The problem is that the XDP_TX
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     filter affects both ARP and if the stack 
      
        
        
        
      
      xdp-test
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     namespace managed to "forget" the MAC address 192.0.2.1, it will not be able to resolve this IP. 
Formulation of the problem
Let's move on to the stated task: write SYN cookies mechanism on XDP.
Until now, SYN flood remains a popular DDoS attack, the essence of which is as follows. When establishing a connection (TCP handshake), the server receives a SYN, allocates resources for a future connection, responds with a SYNACK packet, and waits for an ACK. The attacker simply sends SYN packets from fake addresses in the amount of thousands per second from each host from a multi-thousand botnet. The server is forced to allocate resources immediately upon arrival of the packet, and frees up by a large timeout, as a result, memory or limits are exhausted, new connections are not accepted, the service is unavailable.
  If you do not allocate resources for the SYN packet, but only respond with a SYNACK packet, then how can the server understand that the ACK packet that came later refers to a SYN packet that was not saved?  After all, an attacker can also generate fake ACKs.  The essence of SYN cookie is to encode the seqnum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     parameters in seqnum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     as a hash from addresses, ports and changing salt.  If the ACK managed to arrive before the salt change, you can once again calculate the hash and compare it with acknum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  The attacker cannot fake acknum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , since the salt includes a secret, and will not have time to sort out due to the limited channel. 
SYN cookie has long been implemented in the Linux kernel and may even automatically turn on if SYN arrive too quickly and in bulk.
TCP provides data transfer as a stream of bytes, for example, HTTP requests are sent over TCP. The stream is transmitted in pieces in packets. All TCP packets have logical flags and 32-bit sequence numbers:
- The combination of flags determines the role of a particular package. The SYN flag means that this is the first sender packet in the connection. The ACK flag means that the sender received all the connection data before the - acknum
 
 
 
 byte. A packet can have several flags and is called by their combination, for example, a SYNACK packet.
 
 
 
 
 
 
- Sequence number (seqnum) defines the offset in the data stream for the first byte that is transmitted in this packet. For example, if in the first packet with X bytes of data this number was N, in the next packet with new data it will be N + X. At the beginning of the connection, each side selects this number arbitrarily. 
 
 
 
 
 
 
- Acknowledgment number (acknum) - the same offset as seqnum, but does not determine the number of bytes to be transmitted, but the number of the first byte from the recipient that the sender did not see. 
 
 
 
 
 
 
  At the beginning of the connection, the parties must agree on seqnum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and acknum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  The client sends a SYN packet with its seqnum = X
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      The server responds with a SYNACK packet, where it writes its seqnum = Y
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and sets acknum = X + 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  The client responds to SYNACK with an ACK packet, where seqnum = X + 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , acknum = Y + 1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  After that, the actual data transfer begins. 
If the interlocutor does not confirm receipt of the packet, TCP sends it again by timeout.
Firstly, if SYNACK or ACK is lost, you will have to wait for re-sending - the connection is slowed down. Secondly, in the SYN package - and only in it! - a number of options are transmitted that affect the further operation of the connection. Without remembering the incoming SYN packets, the server thus ignores these options, in the next packets the client will not send them anymore. In this case, TCP can work, but at least at the initial stage, the quality of the connection will decrease.
In terms of packages, an XDP program should do the following:
- SYNACK with cookie to respond to SYN;
- respond to ACK RST (disconnect);
- discard other packets.
Algorithm pseudocode along with parsing the package:
Ethernet, . IPv4, . , (*) , . TCP, . (**) SYN, SYN-ACK cookie. ACK, acknum cookie, . N . (*) RST. (**) .
  One (*)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     marks the points where you need to control the state of the system - at the first stage, you can do without them by simply implementing a TCP handshake with the generation of a SYN cookie as seqnum. 
  In place (**)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , while we do not have a table, we will skip the packet. 
TCP handshake implementation
Parse the package and verify the code
  We will need the network header structures: Ethernet ( uapi/linux/if_ether.h
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ), IPv4 ( uapi/linux/ip.h
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ) and TCP ( uapi/linux/tcp.h
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ).  The last one I could not connect due to errors related to atomic64_t
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , I had to copy the necessary definitions into the code. 
All functions that are allocated in C for readability should be built-in at the place of the call, since the eBPF verifier in the kernel prohibits transitions back, that is, in fact, loops and function calls.
#define INTERNAL static __attribute__((always_inline))
  The LOG()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     macro disables printing in the release build. 
  The program is a conveyor of functions.  Each receives a packet in which the header of the corresponding level is highlighted, for example, process_ether()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     expects ether
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     be full.  Based on the results of field analysis, the function can transfer the packet to a higher level.  The result of the function is the XDP action.  So far, SYN and ACK handlers pass all packets. 
struct Packet { struct xdp_md* ctx; struct ethhdr* ether; struct iphdr* ip; struct tcphdr* tcp; }; INTERNAL int process_tcp_syn(struct Packet* packet) { return XDP_PASS; } INTERNAL int process_tcp_ack(struct Packet* packet) { return XDP_PASS; } INTERNAL int process_tcp(struct Packet* packet) { ... } INTERNAL int process_ip(struct Packet* packet) { ... } INTERNAL int process_ether(struct Packet* packet) { struct ethhdr* ether = packet->ether; LOG("Ether(proto=0x%x)", bpf_ntohs(ether->h_proto)); if (ether->h_proto != bpf_ntohs(ETH_P_IP)) { return XDP_PASS; } // B struct iphdr* ip = (struct iphdr*)(ether + 1); if ((void*)(ip + 1) > (void*)packet->ctx->data_end) { return XDP_DROP; /* malformed packet */ } packet->ip = ip; return process_ip(packet); } SEC("prog") int xdp_main(struct xdp_md* ctx) { struct Packet packet; packet.ctx = ctx; // A struct ethhdr* ether = (struct ethhdr*)(void*)ctx->data; if ((void*)(ether + 1) > (void*)ctx->data_end) { return XDP_PASS; } packet.ether = ether; return process_ether(&packet); }
I draw attention to the checks marked A and B. If you comment out A, the program will be assembled, but there will be a verification error during loading:
Verifier analysis: <...> 11: (7b) *(u64 *)(r10 -48) = r1 12: (71) r3 = *(u8 *)(r7 +13) invalid access to packet, off=13 size=1, R7(id=0,off=0,r=0) R7 offset is outside of the packet processed 11 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 Error fetching program/map!
  The key line is invalid access to packet, off=13 size=1, R7(id=0,off=0,r=0)
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     : there are execution paths when the thirteenth byte from the beginning of the buffer is outside the packet.  According to the listing, itβs difficult to understand which line we are talking about, but there is an instruction number (12) and a disassembler showing the lines of the source code: 
llvm-objdump -S xdp_filter.o | less
In this case, it points to a string
LOG("Ether(proto=0x%x)", bpf_ntohs(ether->h_proto));
  by which it is clear that the problem is in ether
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  It would always be so. 
Reply to SYN
  The goal at this stage is to form a correct SYNACK packet with a fixed seqnum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , which will be replaced by a SYN cookie in the future.  All changes occur in process_tcp_syn()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and the surrounding area. 
Package check
Oddly enough, here is the most remarkable line, more precisely, a comment on it:
/* Required to verify checksum calculation */ const void* data_end = (const void*)ctx->data_end;
  When writing the first version of the code, the 5.1 kernel was used, for the verifier of which there was a difference between data_end
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and (const void*)ctx->data_end
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     .  When writing the article, the 5.3.1 kernel did not have such a problem.  Perhaps the compiler accessed the local variable differently than the field.  Morality - On a large nesting, simplifying the code can help. 
  Further routine checks of lengths in honor of the verifier;  about MAX_CSUM_BYTES
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     below. 
const u32 ip_len = ip->ihl * 4; if ((void*)ip + ip_len > data_end) { return XDP_DROP; /* malformed packet */ } if (ip_len > MAX_CSUM_BYTES) { return XDP_ABORTED; /* implementation limitation */ } const u32 tcp_len = tcp->doff * 4; if ((void*)tcp + tcp_len > (void*)ctx->data_end) { return XDP_DROP; /* malformed packet */ } if (tcp_len > MAX_CSUM_BYTES) { return XDP_ABORTED; /* implementation limitation */ }
Package spread
  Fill seqnum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     and acknum
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     , set ACK (SYN is already set): 
const u32 cookie = 42; tcp->ack_seq = bpf_htonl(bpf_ntohl(tcp->seq) + 1); tcp->seq = bpf_htonl(cookie); tcp->ack = 1;
  Swap TCP ports, IP address and MAC address.  The standard library is not accessible from the XDP program, so memcpy()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     is a macro that hides the Clang intrinsic. 
const u16 temp_port = tcp->source; tcp->source = tcp->dest; tcp->dest = temp_port; const u32 temp_ip = ip->saddr; ip->saddr = ip->daddr; ip->daddr = temp_ip; struct ethhdr temp_ether = *ether; memcpy(ether->h_dest, temp_ether.h_source, ETH_ALEN); memcpy(ether->h_source, temp_ether.h_dest, ETH_ALEN);
Checksum recalculation
The IPv4 and TCP checksums require the addition of all 16-bit words in the headers, and the size of the headers is written in them, that is, at the time of compilation it is unknown. This is a problem because the verifier will not skip a regular loop to a variable boundary. But the size of the headers is limited: up to 64 bytes each. You can make a loop with a fixed number of iterations, which can end ahead of schedule.
I note that there is RFC 1624 about how to recalculate the checksum partially if only fixed packet words are changed. However, the method is not universal, and implementation would be more difficult to maintain.
Checksum calculation function:
#define MAX_CSUM_WORDS 32 #define MAX_CSUM_BYTES (MAX_CSUM_WORDS * 2) INTERNAL u32 sum16(const void* data, u32 size, const void* data_end) { u32 s = 0; #pragma unroll for (u32 i = 0; i < MAX_CSUM_WORDS; i++) { if (2*i >= size) { return s; /* normal exit */ } if (data + 2*i + 1 + 1 > data_end) { return 0; /* should be unreachable */ } s += ((const u16*)data)[i]; } return s; }
   ,  size
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
       ,    ,      . 
32- :
INTERNAL u32 sum16_32(u32 v) { return (v >> 16) + (v & 0xffff); }
:
ip->check = 0; ip->check = carry(sum16(ip, ip_len, data_end)); u32 tcp_csum = 0; tcp_csum += sum16_32(ip->saddr); tcp_csum += sum16_32(ip->daddr); tcp_csum += 0x0600; tcp_csum += tcp_len << 8; tcp->check = 0; tcp_csum += sum16(tcp, tcp_len, data_end); tcp->check = carry(tcp_csum); return XDP_TX;
  carry()
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
       32-  16-   ,  RFC 791. 
TCP
      netcat
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     ,   ACK,   Linux  RST-,       SYN β     SYNACK    -       ,     . 
 $ sudo ip netns exec xdp-test nc -nv 192.0.2.1 6666 192.0.2.1 6666: Connection reset by peer
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
              tcpdump
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      xdp-remote
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      , , hping3
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
          . 
SYN cookie
XDP . , , . Linux, , SipHash, XDP .
TODO, :
- XDP- - cookie_seed
 
 
 
 ( ) , , .
 
 
 
 
 
 
- SYN cookie ACK- , IP , . 
 
 
 
 
 
 
:
 $ sudoip netns exec xdp-test nc -nv 192.0.2.1 6666 192.0.2.1 6666: Connection reset by peer
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
            ( flags=0x2
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     β  SYN, flags=0x10
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
     β  ACK): 
Ether(proto=0x800) IP(src=0x20e6e11a dst=0x20e6e11e proto=6) TCP(sport=50836 dport=6666 flags=0x2) Ether(proto=0x800) IP(src=0xfe2cb11a dst=0xfe2cb11e proto=6) TCP(sport=50836 dport=6666 flags=0x10) cookie matches for client 20200c0
IP, SYN flood , ACK flood, :
 sudo ip netns exec xdp-test hping3 --flood -A -s 1111 -p 2222 192.0.2.1
      
      
        
        
        
      
    
        
        
        
      
      
        
        
        
      
    
      
      :
Ether(proto=0x800) IP(src=0x15bd11a dst=0x15bd11e proto=6) TCP(sport=3236 dport=2222 flags=0x10) cookie mismatch
Conclusion
eBPF XDP , . , XDP β , , DPDK kernel bypass. , XDP , , , . , userspace-.
, , , userspace- .
References: