Netronome_Web_Logo_UPE9ULO.original.png

Open Source Packet Filtering: BPF at FOSDEM’19

By Quentin Monnet | Feb 13, 2019

Time flies! One year has passed already since our coverage of FOSDEM 2018, and the 2019 edition is over as well. The event took place as usual in Brussels, on the first weekend of February. In spite of the snow that fell on the Belgian capital most of the day Saturday, more than 8,000 people were expected to attend what might be the biggest Open Source event in Europe. For the fourth time, one of the numerous “devrooms” held in parallel was dedicated to SDN. It hosted no less than twenty quality talks  about topics like DPDK, VPP, continuous integration and tests in SDN, open drivers and other fast networking solutions.

Oh, and about BPF, too.

BPF has again drawn a lot of attention, although we do notice some evolution in the way the topic is presented. In this post, I do not intend to provide a full coverage of what I saw at FOSDEM; instead, I would like to comment on three aspects. First, even without a full coverage report, a brief overview of these BPF talks at FOSDEM seems in order. Second, I would like to describe what is, from my point of view, the current stage of BPF presentations in the latest conferences I could attend (meaning there will be some notes from LPC as well). Lastly, I would like to transcribe and summarize my talk at the SDN devroom at FOSDEM about the different packet filtering mechanisms on Linux and how through new proposals these frameworks converge to some degree. Spoiler: it will introduce an interesting work in progress that is supposed to turn filtering rules from tc, ethtool, iptables, into C programs ready to be compiled to BPF. Are you ready for some BPF love?

BPF in FOSDEM Presentations

Just to avoid confusion, let me remind that “BPF” in this post should be heard as “eBPF“, the “extended“ version of the Berkeley Packet Filter, the new version that allows for fast network processing (not only filtering) and kernel monitoring in Linux (and not the legacy “cBPF” version). So BPF was a recurring topic at FOSDEM, particularly in the SDN devroom.

One of the most interesting talks was the update about XDP presented by Jesper Brouer and Magnus Karlsson, “XDP as a building block for other FOSS projects.” As suggested, the focus was not so much on introducing the basis of BPF and XDP — although we got some interestingly high-performance numbers. Instead, it was about how to use XDP, how to integrate it into new projects, to extend it and add new features. The second part of the talk described how AF_XDP works and can be used to drive packets to user space software such as DPDK or Suricata. Everything you need to get convinced to run XDP on your machines!

This was followed by an update on Cilium by Michal Rostecki, an active contributor to the project. Over the last year this framework, leveraging BPF to provide API-level networking and security rules, reached version 1.0 and beyond, and gained support for a number of other network or container-related technologies such as Istio, Cri-o, Containerd, Prometheus, Clustermesh, etcd, Kube-router, Cassandra, Memcached, Flannel and others! Cilium is undoubtedly one of the leading BPF users, and it is paving the way for the many others to come.

The presentation next to Cilium was about Merging System and Network Monitoring with BPF. Luca Deri and Samuele Sabella presented their work, which included their libebpfflow library used by the ntop software (network monitoring) to correlate information retrieved with BPF programs for both the system and its traffic, in order to display an accurate set of metrics on flows being used, including in containerized environments. This work, they said, should be available in the next version of ntopng in the spring.

Then came my talk, but we will come back to this later.

The last SDN devroom talk on BPF was about Oko: Open vSwitch Extensions with BPF. Developed and presented by Paul Chaignon, these extensions change the Open vSwitch data path, so that it uses programs generated by a user space JIT compiler (based on uBPF) to process packets at a much faster rate than the alternative setups (vhost-user or process with shared memory, please refer to the slides for details) for the three use cases considered: packet filtering, stateful firewall and TCP analysis. BPF even brings acceleration to user space!

It is worth noting that all talks in the SDN devroom involving BPF attracted a lot of attendees. The room was packed with attendees having to stand in the aisles, and more of them were even queuing at the entrance, and unfortunately, not able to make their way in to the talk. Let’s hope we get a bigger room next year.

Finally, a couple of other presentations about BPF focused on the tracing aspects. There was at least one about kubectl-trace for monitoring Kubernetes (containers and Kubernetes were hot at FOSDEM this year again), another one about distributed Kubernetes performance analysis, and one oriented towards the use of bcc to trace the kernel. I did not attend all of them, and since they are not about networking anyway, I will not go further into the details. But as you can see, BPF really was in many tracks this year!


BPF as of Early 2019

From the number of BPF talks, and the length of the queues for entering the devrooms, it is quite obvious that the topic is receiving more and more attention from the public every year. But it is also interesting to look at the actual content of these talks, or even at the way those presentations are organized.

In November 2018, the whole BPF team at Netronome attended the Linux Plumbers Conference, which gathered many of the core Linux developers in Vancouver. We even presented at the networking track and the BPF microconference. Sadly, we did not report about that on this blog (although be sure to check the update on BPF offload we published just before the event), but here is an observation from that event. If previous years were about discovering BPF and XDP, then slowly getting it to the trial of real-world deployments, in 2018 we could see a turn in events. Things were no longer about suppositions; they were about actual, empirical feedback. Big companies such as Cloudflare or Facebook have been running XDP in production for more than a year, and they can assert that BPF is as fast as expected, flexible enough to answer to real data center problems, and easy enough to deploy at scale. And then, as companies use BPF more and more, push it to its limits and find new potential use cases, the number of technical improvements discussed at conferences is booming. Bounded loops, better performance with AF_XDP, integration with Open vSwitch, with kTLS... We are definitely in a phase of acceleration for BPF development.

A similar feeling floated in the SDN devroom at FOSDEM. It was not so much about adding features to BPF: the talks were addressed to a public of hackers, but not of core kernel contributors. But in contrast with the early years, presentations were less about explaining what BPF is and what can be achieved with it. Many people know this by now. Instead, they would focus on how to deploy BPF to solve issues, and on what to expect precisely with BPF in terms of practical deployment. People no longer want to understand what the BPF acronym means. Now, they want to dive and make the transition. As proof of this, one could note that speakers no longer thought necessary to introduce BPF in detail in the presentation. The 64-bit format of the instructions? The significance of the operands? The support for maps, helper functions, tail calls? None of it was necessary to introduce again. As of early 2019, BPF is getting notorious!

Unifying Network Filtering Rules for the Linux Kernel with eBPF

Yes, BPF is getting notorious, and more and more users want to benefit from the flexibility and the high-performance it can offer. But these users have often been using other packet processing methods for years before that, and understandably they would like the transition to occur at the lowest possible cost.

As a matter of fact, this desire for an easy transition is leading to an increasing number of proposals trying to “unify,” somehow, the different filtering frameworks available on Linux. This is, in substance, what I presented this year at FOSDEM. I focused on simple filters, like Access Control Lists (ACLs) , when you just want to drop some specific flows. I started by a refresher on the several filtering mechanisms historically available on Linux. They are the following:

  • Netfilter, the firewall and NAT subsystem in Linux, is often the first to come to mind when we speak of filtering, and of dropping flows. The front-end tools include in particular iptables (for the older parts) or nft (for nf_tables, successor to iptables, but somewhat failing to meet widespread adoption).
  • The Traffic Control (TC) framework is mostly used on egress traffic, to provide traffic shaping, scheduling, policing of the flows. It relies on “queueing disciplines” (qdiscs), possibly applied to different “classes” of traffic, to reach this objective. But packets are dispatched into those classes by one of the many available filters (“basic” (ematch), “flower,” “u32,” also BPF now), which can also be applied to ingress packets, thus providing another way to implement filters.
  • When the hardware supports it, the ethtool utility can be used to set up hardware filters on the NIC. This is called “receive network flow classification,” and is used to perform flow steering on the incoming traffic. But it is also possible to use it to drop packets at a very early stage.
  • The last mechanism is not really used to filter packets out of the stack, but I wanted to speak about the pcap-filter expressions used, for example, with tcpdump. They are also used to filter packets in their own way, even if in that case matching patterns are not used to drop packets but to copy them to user space (where they are dumped to the user, in the case of tcpdump).
With this list, my point is to show that we have several mechanisms with dedicated features, but also with some overlap regarding basic filtering capabilities. Of course, each of them has its own and unique syntax. So for dropping IP(v4) HTTP packets sent to the server (or to dump them in the case of tcpdump), we have as many expressions for the rule as we have tools available.

With perf events support, and with our last helper in date, which makes it possible to adjust the size of the end (tail) of the packet, the list of helper functions that we support currently is as follows:
    # iptables -A INPUT -i eth0 \
    -p tcp --dport 80 -j drop
    # nft add rule ip filter input iif eth0 \
    tcp dport 80 drop
    # tcpdump -i eth0 \
    ip and tcp dst port 80
    # tc filter add dev eth0 ingress flower \
    ip_proto tcp dst_port 80 action drop
    # ethtool -U eth0 \
    flow-type tcp4 dst-port 80 action -1
That does not make things easy to move from one framework to another if the necessity arises. Then, of course, we have BPF, that can be attached to one of the networking hooks: sockets, TC (ingress), XDP, or even offloaded to the hardware in the case of Netronome's Agilio SmartNICs. Here is a (simplified) diagram of the location of the hooks for each framework in the kernel:

FOSDEM19

Note that on sockets, this can be legacy BPF — this is the case with tcpdump filters, compiled from pcap-filter expressions into cBPF — as well as eBPF. So, we have BPF now, and it comes with a number of advantages. First, as a programmable virtual machine, it is more expressive and more versatile than any of the previously mentioned filtering methods. And then of course, because of the JIT  compiling of the instructions, the driver-level XDP hook, and the possibilities in terms of hardware offload, it performs blazingly fast, especially when compared with the higher-level hooks in the kernel stack.

But for all the speed it provides, BPF does not offer the simplicity of the older tools. When one command line was enough to add a simple rule with iptables, a whole C program usually has to be written for performing even the most basic operations. Sometimes, this makes it harder for system administrators possessing their own set of rules, expressed in the syntax of one of the older frameworks, to actually do the transition. It makes it harder for users to switch to the full performance offered by XDP.

As an answer, solutions are being proposed. They result from this desire to make things more reusable, or also from an effort to reduce the complexity induced by those multiple frameworks for developers. Among those efforts to bring some convergence between the different models, there are three that I (briefly) presented.

An Intermediate Representation in the Kernel: The flow_rule Infrastructure

The first one is mostly to lighten the load for developers. It does not involve BPF but remains very much on topic. It is a work in progress by Pablo Neira Ayuso, a series of Requests For Comments (RFCs) sent to the Linux networking development mailing list to propose a “flow_rule” infrastructure, that would come as a way to provide an Intermediate Representation (IR)  for different kinds of ACLs for hardware offloads. Based on the current Linux flow dissector infrastructure and on TC actions, it would be usable from different front-ends such as the one for hardware filters, Netfilter, or TC. The rules from those front-ends would be translated to the proposed IR, and then only this representation would be forwarded to the driver.

fosdem2

As a result, the driver would only need to be able to parse one format of rules (the IR), making things easier. Additionally, it provides a better uncoupling of the front-end and back-end, therefore hiding implementation details from TC to the driver and thus making it easier to add features to TC in the future. The proposal is currently under review. Incidentally, it turns out that the author submitted a seventh version of the set a few minutes after my FOSDEM talk. It might be merged by the time this text is published.

A New Back-End for iptables: bpfilter

Another not related effort to unify elements would be bpfilter. Envisioned as a replacement for Netfilter's back-end, this new BPF-based framework will take iptables (and then possibly nft) rules from user space and communicate them to the bpfilter kernel module. This module will pass them to a special executable running in user space (the bpfilter “User Module Helper” (UMH)) which will translate the rules into ... a BPF program, of course! And this program will then be attached in the kernel.

fosdem3

This project will make it possible to reuse existing iptables rules and to run them as superfast BPF programs while leaving the current iptables binary untouched.For more details, I invite you to read our blog on the topic.

As of this writing, bpfilter is partly merged in the kernel. The module made it to Linux 4.18. However, it is not complete, in the sense that all the code for actually translating the rules into BPF programs, although sent along the initial RFC, is missing for now. But it should land eventually, and iptables rules will soon be transparently accelerated at no cost!

A User Space Library to Convert Rules: libkefir

At last, this presentation was an opportunity to introduce a library I have started to work on recently. Called “libkefir,” as in KErnel FIltering Rules, its objective is to be used in projects expected to convert rules in one of the various formats described earlier into BPF programs. So we could imagine having a simple tool that would take a set of rules for tc flower, or for ethtool hardware filters, or for iptables, or written as pcap-lib filters, and that would compile the whole set into a single BPF program. This is not so far from the principles behind bpfilter, but there are some differences. The library is user space only, and it may just interact with the kernel in order to load programs through the bpf() system call. But loading BPF programs is not the main objective here (we have libbpf for that anyway). Instead, the functions exposed by the library offer a way to dump a BPF-compatible C program, not just BPF bytecode, so that users can take these C sources and hack them as much as they want. Compiling and injecting the program is trivial anyway, once we have a correct C source program.

fosdem4

We hope that this library, and tools built on it, will make it much easier to test (and adhere to) BPF and that users will find it a handy way to convert what they already know — the rules in those older formats — to what they really want: BPF performance!

Although we intended to publish an early version of this work before FOSDEM, it turned out the library was definitely not ready for publication at that time. But we are working on it, so stay tuned!

Other Hints of Convergence

In regard to convergence of the filtering models, there are a few more projects that are worth mentioning, although I will not dive into the details. P4, for example, can be considered as another abstraction for unifying the representation of switches (and as such, of packet filters). Interestingly, P4 programs can be compiled into XDP programs, too.

DPDK, the main kernel-bypass solution for processing packets in user space, uses yet another set of filtering mechanisms (rte_flow APIs), ... One of which is actually based on BPF. Yes, DPDK now implements its own BPF JIT-compiler! This highlights how versatile BPF is in regard to filtering capabilities. It also demonstrates that BPF is sufficiently spread and understood now to attract not only developers familiar with XDP but also DPDK users.

Open vSwitch is a well-known virtual switch, and as such, it is able to drop flows as well. Does it rely on BPF? Not yet. But there is an ongoing effort to create an alternative datapath based on BPF for the software. It is possible that we get rules expressed as Open vSwitch flows translated to BPF programs (although to be fair this may not happen, the authors of this work may decide to keep a datapath closer to the actual Open vSwitch engine but to drive packets with AF_XDP for improved performance instead of reimplementing all the flows in BPF).

Another interesting lead is that at the lowest level of the kernel, it is also possible to consider eBPF as a heterogeneous processing ABI, as presented by my Netronome colleague, Jakub Kicinski, at the latest Linux Plumbers Conference. This means that BPF could be used as a suitable representation for handling network-processing programs inside the kernel, and possibly offload them to the hardware. In contrast with the current status of the kernel, the proposals of this talk include decoupling the different JIT-compilers to their architectures (e.g., use a RISC-V JIT on an x86 host). This  ensures that programs can be compiled in the kernel not only for the current system but also for other ones (such as for SmartNICs—without having to do all the work in the driver for that SmartNIC).

The very last item I wanted to mention was bpftrace. This is a tool for tracing and monitoring on Linux, based on BPF. It is not related to networking, but what fascinates me is that it relies on a  Domain-Specific Language (DSL) in order to hide all the complexity of BPF programs to the end user, and to allow for easy injection of programs. TC flower filters, iptables rules... Could they be seen as the DSL of BPF for network filtering? I like to think so.

Convergence Towards an Easy Transition

A number of tools and ideas have been proposed to help with the filtering mechanisms on Linux, and somehow making them converge towards some intermediate representation — which is often BPF-based. Will this help users to make the transition? This is clearly the impression we got at the latest conferences we attended. If you are still hesitating about BPF, be sure of two things: Hardware offloads have nice days ahead, and BPF will have an increasingly important role to play in it (or in filtering in general). BPF on your infrastructure is definitely a safe investment for the future!