Netronome_Web_Logo_UPE9ULO.original.png

Ever Deeper with BPF – An Update on Hardware Offload Support

By Quentin Monnet | Nov 07, 2018
Netronome’s Agilio CX and FX SmartNICs are capable of running BPF programs, directly offloaded from the Linux kernel. This feature has been supported for a couple of years now. Several presentations of BPF hardware offload have been made by the BPF engineering team in conferences such as Netdev 1.2, Netdev 2.2 or even FOSDEM 2018, or in webinars such as our recent webinars on BPF just-in-time (JIT) compiler, offload internals, and tooling.

As time passes, we keep improving BPF support on our cards… and we keep attending conferences. Lots of them! Recently, we had a talk accepted for the Networking track of the upcoming Linux Plumbers Conference (and several at the BPF Microconference which is featured at LPC). Nic will talk about the use of BPF as an abstraction for high-performance switching. The event will take place in November in Vancouver, and Netronome is proud to be a Gold Sponsor. Interestingly, answering to the call for papers for this track was a good occasion for the whole team to take a step back and ponder what we had achieved over the last months regarding BPF hardware offload. So, what have we achieved indeed?
BPF Offload Framework Architecture
Here is a brief reminder of the hardware offload architecture for BPF. Please note that throughout this document, “BPF” denotes the extended version “eBPF” (and not the legacy “cBPF”).
  • First, BPF bytecode is generated by the user, usually by compiling from a program in C (for example) thanks to clang and LLVM backend for BPF.
  • This program is then loaded into the kernel with a call to the bpf() system call. For networking program, this call is typically made by tools like tc filter add or ip link. When the program is destined to be offloaded, an additional interface index is passed to the system call to indicate what device it will be pushed onto.
  • The program is verified in the kernel. This is to make sure that it safely terminates and does not have any security issues.
  • Once verified, the program may be JIT compiled to native instructions in order to achieve better performance at runtime. Programs offloaded to the hardware are always compiled into instructions that the relevant device will be able to run.
  • The program lives in kernel memory and can be attached to one or several hooks. Hooks exist for a variety of use cases. There are several ones for network packet processing: one on sockets, one at the Traffic Control (TC) interface, another one being the eXpress Data Path (XDP) at the driver level. For networking, the request to attach the program is usually performed by tc or ip, during the same invocation, right after loading the program into the kernel.
  • In the case of offload, a number of specific callbacks from the device driver are called by the kernel, each at a different stage. There are used in order to:
    • Create offloaded maps
    • Prepare verification of the BPF program on the driver side
    • Check individual BPF instructions (in addition to the checks already performed by the kernel verifier)
    • Translate (JIT-compile) the program into instructions for the hardware device
    • Offload the program, i.e. push it to the card when the user decides to attach it, and bind it to a network interface on the host
    • Delete the program when the user does not need it anymore

Ever Deeper with BPF

BPF Hardware Offload Architecture Diagram

We have been working on cleaning up the code of this architecture and bringing improvements to this model. In particular, once the program has been pushed to the physical device, we have made it possible to share it between several network device interfaces, which means that the program and its maps (see next section) can be used to process incoming packets coming from several different ports of the card, enabling additional use cases.

Another important feature supported by the kernel is the ability to run two distinct BPF programs on both the standard XDP hook (running on the host, at the driver level) and on the hardware, simultaneously. This way the incoming traffic is first processed by the program offloaded to the hardware, where we get the best performance and no host CPU charge. And then, only for packets that are forwarded to the host, the second program is executed and can process further the packet and use all BPF features supported by the kernel, including the few that the NFP does not support yet.
Map Offload
Since last year’s presentations, we gained BPF map support for BPF programs. In order to maintain good performance at runtime, the maps are offloaded to the SmartNIC as well, just as the BPF programs. They are stored in the biggest memory bank on the card (2 GB), but we are working on creating a cache on smaller banks that would be closer to the microengines, for even faster access. Even though they are on the hardware, BPF maps can be accessed from the Linux host, therefore the following operations are possible:
  • Reading values from the map from user space (for debugging or for collecting statistics).
  • Writing values from user space (for setting parameters read at runtime by the BPF program, e.g., a list of IP addresses to block).
  • Reading values from the offloaded BPF program (for all use cases requiring table lookups).
  • Writing values from the offloaded BPF program. At this time normal writes necessitate the use of locks on the offloaded maps, so we offer an alternative atomic write operation, which is much faster and extremely useful for quickly incrementing counters for instance.


Array maps and hash maps are currently supported. Maps are essential with BPF as they enable a wide variety of applications and we are proud to support them.

Helpers and Perf Events Support
BPF helper functions are a set of white-listed functions defined in the kernel, and that BPF programs can call at runtime. They provide an interface to some kernel components (mostly for tracing programs), to BPF maps, to packet size manipulation, or implement frequently required operations. We do not support all of those helpers, as they need to be implemented in the firmware itself, and a lot of those functions would not make sense for networking programs running on the NIC. We have, of course, the ones needed to look up, add/update, or delete elements from BPF maps.

One helper that we did not have until recently, though, was the one required for dumping perf events. It comes with a special kind of BPF maps, in which the user registers the so-called “perf-event” they want to trace. Then the BPF program can use this map to stream data from specific counters to user space, which is especially useful for statistics, debugging, sampling, or actually streaming whatever data a user space application would need to collect from the processing occurring at the XDP level. So, we now have support for this feature. The perf event maps are not offloaded as for arrays of hash maps, but instead they remain on the host. Offloaded programs are capable of retrieving the data they need from these maps, and to send the data to user space as expected. And, it works great!

With perf events support, and with our last helper in date, which makes it possible to adjust the size of the end (tail) of the packet, the list of helper functions that we support currently is as follows:
  • bpf_map_lookup_elem() 
  • bpf_map_update_elem() 
  • bpf_map_delete_elem() 
  • bpf_get_prandom_u32() 
  • bpf_perf_event_output() 
  • bpf_xdp_adjust_head() 
  • bpf_xdp_adjust_tail()

An example usage of perf events is available with the “xdpdump” demonstration that we published not long ago.

BPF-to-BPF Function Calls

BPF-to-BPF function calls were added to the kernel last December. These are traditional function calls—by contrast with calls to kernel helpers, not implemented with BPF instructions but compiled as part of the kernel, or “tail calls” where you jump into a second program but never come back to the first. The user can define several non-inlined functions in their C source file, and compile them into BPF. Since we still do not have loops in BPF, this is especially useful to reuse a sequence of instructions without having to unroll the whole sequence each time: just repeatedly call your function instead. Support for BPF-to-BPF calls has just been upstreamed to the development branch of the kernel.

Under the Hood

Even if it does not translate into new features, a number of improvements have made their way to the internals of the driver. Think about it as “under the hood” optimizations or low-level improvements. Our JIT compiler now supports logic and arithmetic indirect shifts, one of the very few BPF instructions it did not have the ability to process thus far. Multiplication was improved to work on 8-bit or 16-bit operands only.

Many other technical optimizations have been delivered, often consisting in detecting a particular sequence of instructions and replacing it by something more efficient. The objective is:

  • To reduce the number of instructions in the final program, either for better performance or for supporting a higher total number of instructions.
  • To gain performance by favoring instructions that are faster to execute for the hardware.
  • To optimize resources usage, such as the amount of memory used from the stack.

Offload-Specific Features

For most BPF programs, we try very hard to ensure that the offload is transparent for the user, and that the behavior is as close as possible as what would be obtained if the program was run on the host—save for the gain of performance, obviously. And yet, at times it is interesting to account for the difference of context, and to add something that would not make sense on the host, but that a BPF program can very easily do when running on the NIC. What I have in mind is programmable Receive-Side Scaling (RSS). This is a common feature on recent NICs (but not usually performed with BPF, of course), consisting in distributing the incoming traffic between host CPUs through the use of multiple queues. With BPF offload, this was trivial to implement: we simply allowed setting a RX queue metadata field which is read-only for the kernel so the program can indicate to which queue the processed packet should be sent. And this is it. If you feel interested, check out the programmable RSS demonstration application we recently published.

Improved Tooling and Resources

Along with the cards, Netronome’s NFP driver and the kernel, we continue to improve the tooling and the available resources so as to make offloading BPF programs the smoothest experience possible…even for people who are not familiar with the technology in the first place!

We keep maintaining and improving bpftool, the go-to utility for everything related to BPF introspection and simple management. The tool has received numerous contributions from the community, and getting information about the structure of keys and values in BPF maps has never been easier now that BTF is supported! This mechanism makes it possible to embed the structure of some BPF objects, so that they can be retrieved later for better introspection and debugging. For example, compare the following plain hexadecimal dump from a map entry:

# bpftool map dump id 1337
key:
0a 00 00 00 00 00 00 00
value:
07 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00
03 02 01 4d 15 00 00 00

It can now be dumped with the actual structure that was used for keys and values when the map was created by the user, as in the following example:


# bpftool map dump id 1337
[{
       "key": 10,
       "value": {
           "": {
               "value": 7,
               "ifindex": 1,
               "mac": [0, 0, 21, 73, 1, 2, 3]
           }
       }
    }
]

Of course it works with offloaded programs, making it much easier to understand what is loaded in the kernel. Still about introspection, we pushed the code required for disassembling the NFP instructions used by our cards to the binutils-dev library, so bpftool will soon be able to natively dump the programs JIT-compiled for our SmartNICs. And last but not least, bpftool is also getting easier to install, since it is now packaged for Fedora (starting with Fedora 28) placing it just one dnf install bpftool away from your system! There is no package for Debian/Ubuntu as of this writing, but we are working on that, too.

But bpftool is not the only thing that improved. Netronome recently launched a brand new support website! Here are a number of items that you can find on the site under the Agilio eBPF Software section:

  • The Agilio BPF firmware required for running offloaded BPF programs on the CX and FX SmartNICs
  • A .deb package with a statically-linked binary for bpftool, if you want to quickly install it on Debian or Ubuntu
  • Documentation about BPF offload, including our eBPF Offload Getting Started Guide
  • Additional resources, such as a number of video tutorials with step by step instructions on how to offload programs
And this is just a beginning! We intend to add more, as we intend to add new demonstration applications to the ones we recently published on GitHub.
More to Come
And of course, we keep extending BPF offload support itself! We have a number of things on the roadmap.

Support for BTF, for example, is on the list. We want users to have a great experience with offloads, and we know that getting information about the structure of map entries can be of great assistance when administering a system.

We are also working on a powerful mechanism that was recently introduced in Linux, AF_XDP. This new family of network socket can be used to drive packets from hardware to user space with extremely low overhead. It requires support from the hardware, which we definitely intend to provide, and we are exploring how to make it work best with offloaded programs.

We work on more bpftool improvements and features. We contribute to documenting BPF. We prepare more presentations about BPF. We are pushing support for the bpf() system call in valgrind, to chase memory leaks in software using BPF programs. We work on 32-bit support, in order to have LLVM producing BPF programs that are even more efficient when run on the SmartNICs. We work on the control flow graph in the kernel verifier, so that one day we can have loops in BPF, including for offloaded programs. We work on BPF tail calls support. We work on all networking aspects of BPF. We love BPF. And we do our best so that you can love it, too!