You may have been following the development of the extended Berkeley Packet Filter (eBPF) in the kernel community since 3.15, or you may still associate the Berkeley Packet Filter with the work Van Jacobson did in 1992. You may have used BPF for years with tcpdump, or you may have started to plumb it in your data plane already! This blog aims to describe, at very high level, the key developments from a performance networking point of view and why now this is becoming important to the network operator, sysadmin and enterprise solution provider in the same way that it has been relevant since its inception for the large scale data center operator.
BPF or eBPF—What is the difference?
The Virtual Machine
Fundamentally eBPF is still BPF: it is a small virtual machine which runs programs injected from user space and attached to specific hooks in the kernel. It can classify and do actions upon network packets. For years it has been used on Linux to filter packets and avoid expensive copies to user space, for example with tcpdump. However, the scope of the virtual machine has changed beyond recognition over the last few years.
Figure 1. A comparison of the cBPF vs. eBPF machines
Classic BPF (cBPF), the legacy version, consisted of a 32-bit wide accumulator, a 32-bit wide ‘X’ register which could also be used within instructions, and 16 32-bit registers which are used as a scratch memory store. This obviously led to some key restrictions. As the name suggests, the classic Berkeley Packet Filter was mostly limited to (stateless) packet filtering. Anything more complex would be completed within other subsystems.
eBPF significantly widened the set of use cases for BPF, through the use of an expanded set of registers and of instructions, the addition of maps (key/value stores without any restrictions in size), a 512 byte stack, more complex lookups, helper functions callable from inside the programs, and the possibility to chain several programs. Stateful processing is now possible, as are dynamic interactions with user space programs. As a result of this improved flexibility, the level of classification and the range of possible interactions for packets processed with eBPF has been drastically expanded.
But new features must not come at the expense of safety. To ensure proper exercise of the increased responsibility of the VM, the verifier implemented in the kernel has been revised and consolidated. This verifier checks for any loops within the code (which could lead to possible infinite loops, thus hanging the kernel) and any unsafe memory accesses. It rejects any program that does not meet the safety criterions. This step, performed on a live system each time a use tries to inject a program, is followed by the BPF bytecode being JITed into native assembly instructions for the chosen platform.
Figure 2. Compilation flow of an eBPF program on the host. Some supported CPU architectures are not displayed.
To allow any key functionality which would be difficult to do or optimize within the restrictions of eBPF, there are many helpers which are designed to assist with the execution of processes such as map lookups or the generation of random numbers. My colleague, Quentin Monnet, is currently going through the process of getting all kernel helpers documented and has a patch set out for review
The hooks - Where do the packets get classified?
The amount of hooks for eBPF is proliferating due to its flexibility and usefulness. However, we will focus on those at the lower end of the datapath. The key difference here being that eBPF adds an additional hook in driver space. This hook is called eXpress DataPath, or XDP. This allows users to drop, reflect or redirect packets before they have an skb (socket buffer) metadata structure added to the packet. This leads to a performance improvement of about 4-5X.
Figure 3. High-performance networking relevant eBPF hooks with comparative performance for simple use case
Offloading eBPF to the NFP
Back in 4.9, my colleague and our kernel driver maintainer, Jakub Kicinski, added the Network Flow Processor (NFP) BPF JIT-compiler to the kernel, initially for cls_bpf (https://www.spinics.net/lists/netdev/msg379464.html
). Since then Netronome has been working on improving the BPF infrastructure in the kernel and also in LLVM, which generates the bytecode (thanks to the work of Jiong Wang). Through the NFP JIT, we have managed to effectively modify the program flow as shown in the diagram below:
Figure 4. Compilation flow with NFP JIT included (some supported CPU architectures are not displayed)
The key reason this was possible was because of how well the BPF machine maps to our flow processing cores on the NFP, this means that the NFP-based Agilio CX SmartNIC running at between 15-25W can offload a significant amount of processing from the host. In the load balancing example below, the NFP processes the same amount of packets as nearly 12 x86 cores from the host combined, an amount physically impossible for the host to handle due to PCIe bandwidth restrictions (Cores used: Intel Xeon CPU E5-2630 v4 @ 2.20GHz).
Figure 5. Comparative performance of a sample load balancer on the NFP and x86 CPU (E5-2630 v4)
Performance is one of the main reasons why using BPF hardware offload is the correct technique to program your SmartNIC. But it is not the only one: let’s review some other incentives.
Flexibility: One of the key advantages BPF provides on the host is the ability to reload programs on-the-fly. This enables the dynamic replacement of programs in an operating data center. Code which would otherwise likely be out-of-tree kernel code, or inserted in some other less flexible subsystem, can now be easily loaded or unloaded. This provides significant advantages to the data center due to bugs not requiring system restarts: instead, simply reloading an adjusted program will do.
This model can now be extended to offload as well. Users are able to dynamically load, unload, reload programs on the NFP while traffic is running. This dynamic rewriting of firmware at runtime provides a powerful tool to reactively use the NFP’s flexibility and performance.
Latency: By offloading eBPF, latency is significantly reduced due to packets not having to cross the PCIe boundary. This can improve network hygiene for load balancing or for NAT use cases. Note that by avoiding the PCIe boundary, there is also a significant benefit in the DDoS prevention case as packets no longer cross the boundary, which could otherwise form the bottleneck under a well constructed DDoS attack.
Figure 6. Latency of offloaded XDP vs. XDP in the driver, note the consistency in latency when using offload
Interface to program your SmartNIC datapath: By being able to program a SmartNIC using eBPF, it means that it is extremely easy to implement features such as rate limiting, packet filtering, bad actor mitigation or other features that traditional NICs would have to implement in silicon. This can be customized for the end user’s specific use case.
Ok -this is all great - but how do I actually use it?
The first thing to do is to update your kernel to 4.16 or above. I would recommend 4.17 (development version as of this writing) or above to take advantage of as many features as possible. See the user guide *linked* to get the features from the latest version, and examples on how to utilize them.
My colleague, David Beckett, has done a couple of great videos showing how to use XDP, find the one about the general use case here
Without entering into the details here, it can also be noted that the tooling related to eBPF workflow is under development, and has already been greatly improved regarding the legacy cBPF version. Users would now typically write programs in C, and compile them with the back-end offered by clang-LLVM into eBPF bytecode. Other languages, including Go, Rust or Lua, are available too. Support for the eBPF architecture was added to traditional tools: llvm-objdump can dump eBPF bytecode in a human-readable format, llvm-mc can be used as an eBPF assembler, strace can trace calls to the bpf() system call. Some work is still in progress for other tools: binutils disassembler should support NFP microcode soon, and valgrind is about to get support for the system call. New tools are created as well: bpftool, in particular, is exceptionally useful for introspection and simple management of eBPF objects.
A key question which is still outstanding at this point for the enterprise sysadmin or IT architect is how this applies to the end user who has a setup which has been perfected and maintained for years, and which is based upon iptables. The risk of changing this setup is that, obviously, something may be hard to countenance, layers of orchestration would have to be modified, new APIs should be built, etc. To solve this problem, enter the proposed bpfilter project. As Quentin wrote earlier this year
“This technology is a proposal for a new eBPF-based back-end for the iptables firewall in Linux. As of this writing, it is in a very early stage: it was submitted as a RFC (Request for Comments) on the Linux netdev mailing list around mid-February 2018 by David Miller (maintainer of the networking system), Alexei Starovoitov and Daniel Borkmann (maintainers of the BPF parts in the kernel). So, keep in mind that all details that follow could change, or could not ever reach the kernel at all!
Technically, the iptables binary used to configure the firewall would be left untouched, while the xtables part in the kernel could be transparently replaced by a new set of commands that would require the BPF sub-system to translate the firewalling rules into an eBPF program. This program could then be attached to one of the network-related kernel hooks, such as on the traffic control interface (TC) or at the driver level (XDP). Rule translation would occur in a new kind of kernel module that would be something between traditional modules and a normal ELF user space binary. Running in a special thread with full privilege but no direct access to the kernel, thus providing less attack surface, this special kind of module would be able to communicate directly with the BPF sub-system (mostly through system calls). And at the same time, it would remain very easy to use standard user space tools to develop, debug or even fuzz it! Besides this new module object, the benefits from the bpfilter approach could be numerous. Increased security is expected, thanks to the eBPF verifier. Reusing the BPF sub-system could possibly make maintenance of this component easier than for the legacy xtables and could possibly provide later integration with other components of the kernel that also rely on BPF. And of course, leveraging just-in-time (JIT) compiling, or possibly hardware offload of the program would enable a drastic improvement in performance!”
bpfilter is being developed as a solution to the problem. It allows the end user to seamlessly move to this new, high performance paradigm. Just see below: comparing eight cores of CPU and the offload to the NFP of a simple series of iptables rules with iptables (netfilter) legacy back-end, the newer nftables, bpfilter on the host and offloaded to the SmartNIC clearly shows where performance lies.
Figure 7. Performance comparison of bpfilter vs older iptables implementations
An example video of how to implement this from David can be found here
So there we have it. What is being produced within the kernel community as it stands is a massively powerful shift in networking. eBPF is a powerful tool that brings programmability to the kernel. It can deal with congestion control (TCP-BPF), tracing (kprobes, tracepoints) and high-performance networking (XDP, cls_bpf). Other use cases are likely to appear, as a result of its success among the community. Beyond this, the transition extends until the end user, who will soon be able to seamlessly leave the old iptables back-end in favor of a newer, and much more efficient XDP-based back-end—using the same tools as today. In particular, this will allow for straightforward hardware offload, and provide the necessary flexibility as users move to 10G and above networks.
Thanks for reading! Any insights belong to my colleagues, any errors are mine… Feel free to drop me an email if you have any further questions at email@example.com and cc firstname.lastname@example.org.
Other useful links
Netdev 1.2, eBPF/XDP hardware offload to SmartNICs, Jakub Kicinski & Nic Viljoen: Video
Netdev 2.2, Comprehensive XDP offhandling the edge cases, Jakub Kicinski & Nic Viljoen: Video
Quentin’s personal blog
contains a great set of links towards additional eBPF and XDP resources.