Agilio Performance, Explained

By Netronome | Mar 25, 2016

In many of the discussions I have with potential Agilio users and customers, the following question frequently comes up: “What is different about the Agilio platform and architecture that allow it to outperform other server-based networking solutions?” In networking, and especially in stateful networking, thread capacity is king. To drive high packet rates in networking applications, a high degree of parallelism through multi-threading is required.

A challenge with executing networking applications on general-purpose compute processors (e.g. Multicore MIPS cores, x86, or Arm cores) is that they lack sufficient thread capacity within tolerable power and cost profiles. Due to this low thread capacity, these cores must rely on an on-chip cache hierarchy (L1, L2, L3 caches) to achieve best-case performance. As a result, these cores tend to be larger (in cost and power) with only a singe thread or dual threads. If the application is continuously operating on a large working set, such as the case for networking applications with exact match tables with many entries, then on-chip caches do not help. As packet rates increase, every new packet arriving on the device flushes out the on-chip caches and forces the application execution cycle to stall on the CPU while critical data is accessed from external memory sources (e.g. DDR3). This is where parallelism can help.

Netronome has implemented a large number of cores on its network flow processor (NFP), which drives the Agilio platform. Each RISC core on the NFP is 8-way threaded and is supported with a networking-optimized instruction set. The NFP cores are smaller in size, power, and cost, which allow an Agilio CX SmartNIC to support 48 cores (384 threads) while consuming less than 25W. This means that 384 packets can be processed in parallel, compared to only 24 packets in parallel on a 12-core dual threaded general-purpose processor. The parallelism allows for application execution to continue while networking data structures are being accessed. When memory I/O operations are in process, the NFP core can efficiently swap threads and continue execution without cycle time being spent (or wasted) throughout the memory I/O access time. This means the NFP cores are continuously operating on packets without down time, where as a general-purpose processor with a low thread count is often left waiting (a.k.a. stalling) each time memory is accessed. This is a very important architectural characteristic that contributes to Netronome’s leadership in high-performance server networking.

Many papers and articles have been written on the relationship between parallelism and networking. While this blog entry does not go into extreme detail on processor implementation and architecture, it aims to explain at a high level why Agilio SmartNICs show such high performance for networking applications like Open vSwitch (OVS) and Contrail vRouter compared to identical implementations on general-purpose compute platforms.

Robert Truesdell