Netronome_Web_Logo_UPE9ULO.original.png

Optimizing BPF: Smaller Programs for Greater Performance

By Quentin Monnet | Jan 14, 2020

For decades, the evolution of computing performance has been governed by Moore’s law, which stated that every two years, the size of transistors would decrease and their numbers in dense integrated circuits would double. Then physics caught up, and the industry started to add processing cores to compensate the slow down of that curve (or to be more accurate, the end of frequency scaling). Of course, using adequate hardware to offload specific tasks from the host computer also provides drastic improvements in terms of computing power – this is why we propose our many-core Agilio SmartNICs for accelerating network processing. But are transistors the only element we should try to shrink if we want to go faster? Even constrained on processor clock frequencies, we can take shortcuts to reach the finish line: If we cannot run faster, then let’s simply run less! With fewer instructions to execute, programs achieve completion sooner. So, how can we make programs smaller?

This post is actually an update from BPF Wonderland. Development on Linux and in particular on BPF is effervescent, and there are novelties that we would like to address since our last update back in February. It so happens that several of those features tend to minimize the size of the BPF programs we run on the SmartNICs, which is primordial for performance, as we have seen, but also for flexibility, because offloading BPF programs to the hardware comes with additional constraints in terms of available memory. Smaller programs will be easier to fit in the code stores used by the NFP microengines. Let’s follow the White Rabbit down the hole and see how we can shrink those programs. No mushrooms involved.

The first major change brought to the BPF infrastructure is the support for 32-bit architectures. This is not to be confused with JIT-compilers for 32-bit architectures, several of which (nfp, ARM32, x86_32) have been in the kernel for some years now. Here we are dealing with the generation of programs relying exclusively (or mostly) on 32-bit operands, which makes them better adapted to 32-bit architectures, and allows us to get rid of so-called “zero-extension” instruction sequences. To understand what this is about, we have to focus on how the eBPF ISA (instruction-set architecture). eBPF works with ten 64-bit registers (plus one used as a stack pointer), but many operations only involve 32-bit long “sub-registers”, which are the low 32 bits of those registers. This may be the case for 32-bit arithmetic operations, for example, and for surrounding 8, 16 or 32-bit load-to-register operations. Support for 32-bit jump operands has also been recently added to the verifier. But when one low sub-register is being written, the ISA specifies that the corresponding high 32-bit sub-register (the other half of the register) has to be cleared (set to zero) for consistency reasons. This is a useful constraint with little performance impact on 64-bit architectures that can work with both 64-bit and 32-bit long (sub-)registers, and where the whole register is updated in a single operation. But for 32-bit architectures, this may not be needed if the program only works with 32-bit sub-registers and the high sub-registers do not present a risk of interference in the operations. For example, let us have a look at the following BPF snippet:


01: # r0 initialized with random data
02: r1 = r0
03: r0 = 0x100000000ULL
04: w0 += w1 # wX represents the lower 32 bits of rX
05: r0 >>= 32
06: return w0

Here the high 32-bit of register r0 are cleared after the 32-bit addition: the value returned is 0 and not 1. Clearing those bits is necessary, as we read the 64 bits of the registers for returning the value at the end. This usually translates into the value in r0 being shifted by 32 bits to the left, then 32 bits back to the right at JIT-compilation time. But if instead we had the following lines:

…
05: w2 = w0
06: # do something with r2
08: r0 = 0x100000000ULL
07: return w0

Then in that case we do not read the high bits of r0 before overwriting them a few lines later, and there is no need to clear them. For these patterns, the clearing constraint may result in additional operations being added just to set the high sub-registers to zero. Where the nfp JIT-compiler is concerned, we estimated that such “zero-extension” sequences can represent up to 40% of the generated code! Obviously, we had to so something about that before the Queen of Hearts noticed.

Adding support for 32-bit BPF programs, and getting rid of the zero-extension sequences required two kinds of changes. First, we had to make sure that BPF programs would use 32-bit operands wherever possible when the target is a 32-bit architecture. To that end, we taught LLVM to build BPF programs by relying principally on the low 32-bit sub-registers. This is now something possible to achieve since LLVM v7, by passing the option -mattr=+alu32 to llc when compiling BPF programs. In a second time, we had to improve the kernel verifier to teach it to recognize those sequences where zero-extension instructions used to be added, but were actually no longer necessary. The verifier tracks the state of BPF registers and is now able to mark the sequences where zero-extension is mandatory. For 32-bit host JIT-compilers (or even for some 64-bit architectures that may need it), the verifier then directly inserts the corresponding zero-extension instructions wherever required, and JIT-compilers do not even have to take care of it anymore. Zero-extension instructions are not added by the verifier in the case of hardware offload, because the JIT-compiler usually does a little more work and needs the initial program unchanged to perform all its optimizations. But it becomes trivial for the JIT-compiler in the nfp driver to omit clearing the high 32 bits of the registers if the sequence has not been marked as requiring zero-extension. So if you compile BPF programs to offload them on compile with the -mattr=+alu32 option: Programs will be smaller and faster!

Another essential change was introduced in the kernel recently. Starting with Linux 5.3, BPF programs can use bounded loops! What are “bounded” loops, one may ask? They are regular loops over a sequence of instructions, such that they are “bounded”. This means we must know for certain that there is a finite number of iterations after which the loop does terminate, and more importantly, this means the verifier must be able to deduce this termination from the BPF code. This is of course to avoid infinite loops – nobody wants their kernel to be stuck on tea time with the Hatter forever. In practice, this works well for all loops of the kind for (i = 0; i < n; i++) {} where n is a constant and i is not modified inside the loop. Previously, it was possible to use such loops in the C code, at the condition of annotating them with the #pragma unroll directive that would ask clang to unroll them at compilation time. So when the feature landed some people thought that it was a minor change, and that mostly it would allow us to remove the pragma directive. Far from that! Unrolling at compilation time means that the compiler must know in advance how many times the code will loop, and then “unrolls” it by writing the instructions from the loop body the given number of times. By contrast, bounded loops also have a constraint on the number of iterations, but we do not need to have it solved at compilation time. For example, when processing network packets with BPF, bounded loops allow us to loop over each byte of the packet:

int xdp_prog(struct xdp_md *ctx)
{
  void *data_end = (void *)(long)ctx->data_end;
  void *data = (void *)(long)ctx->data;
  unsigned int i, len = data_end - data;

  for (i = 0; i < len; i++) {
    …
  }

  return XDP_PASS;
}

This example could not be implemented with unrolled loops, because we would need to know the packet length at compilation time, or at least to support indirect jumps (which eBPF does not have). But it is perfectly acceptable with bounded loops! And needless to say, having real loops instead of unrolling them also contributes noticeably to shrinking the code. This is the same idea as for BPF-to-BPF function calls, for which we added support to the nfp JIT-compiler last year: we avoid duplicating redundant instructions. But it was much easier to have support for loops in the nfp driver. Because all the work happens in the verifier, now able to recognize and allow bounded loops, and because our driver works in cooperation with the kernel, we got support for free! Just as we keep benefiting from most improvements being added to the BPF infrastructure with no change required for hardware offload.

Support for 32-bit architecture and bounded loops are the two most significant changes that were brought to BPF internals recently. Of course, we keep working hard to make the BPF experience as smooth and as fast as possible. Some new features do require an update in the nfp JIT-compiler or the firmware. Work in progress includes support for the BTF format used for embedding BPF debug information with the programs, or batching and accelerating map operations, or improving the tools (bpftool features, bug fixes and packaging for Ubuntu; release of Libkefir). The firmware team is also working hard – did you hear that they open-sourced the CoreNIC firmware, including the BPF offload support component? We all labor hard to make programs small, and BPF hardware offload something big. Even though, I must admit, it will not bring us any closer to the answer to the March Hare’s question: Why is a raven like a writing desk?