An orange (NIC) and a pear (general-purpose CPU) cannot simply be slapped together to make an apple (SmartNIC). The same logic applies to combining a NIC chip and an FPGA chip to create a SmartNIC. The apple (or the SmartNIC) has unique characteristics that cannot be realized by combining two things that happen to be in your hands. An industry thought leader has given a name to these special and unique characteristics in silicon designed for an application like the SmartNIC.
John Hennessy, chairman of Alphabet and co-founder of MIPS Computers Systems, calls them “domain-specific architectures.” John highlights that wanting to scale modern server applications using general-purpose CPUs and multi-core processing is like beating a dead horse.
Applications or use cases – i.e., the “domain” – dictate the architectural requirements needed for a SmartNIC. It is true that domain-specific accelerators used in a SmartNIC must do a few tasks, but they must do these tasks extremely well. These tasks encompass a domain of applications, namely SDX. These applications and use cases drive effective use of parallelism in the processing architecture, more effective use of memory bandwidth and latency tolerance, and a programming model that supports unique processing and memory access capabilities.
Ignoring domain-specific architectural requirements in a SmartNIC product, and throwing commonly available architectures at the problem (NIC chips, Arm processors, FPGAs), is tantamount to combining an orange and a pear and trying to claim it is now an apple. Here is why.
Efficient processing and fast access to flow, rule and buffer tables stored in memory are critical for all networking and security applications. With virtual switching and related security applications such as stateless ACLs or stateful firewalls, access to tables related to rules, counters, micro-flows and sessions, and caching of such flows and sessions is vital to performance at scale. Fast and efficient processing designed to maximize flow and sessions updates per second or insertion of new entries per second has a direct impact on the ability of applications to perform and scale. Similarly, in IPS/IDS applications, access to and processing of data in tables related to firewall and forwarding are relevant. With web/caching applications, the same applies to forwarding rules and processing related to SSL and TCP sessions. With storage applications that work over the network, the previously discussed networking and security requirements apply as well. In addition, other domain-specific acceleration features matter, such as support for erasure coding, compression and deduplication.
Most, if not all, domain-specific features require a multi-threaded, latency-tolerant processing and memory architecture, supported by a domain-specific programming model. For example, coherent and non-coherent memory access matters depending on division and type of tasks (e.g., control plane vs. data plane) between the general-purpose CPU and domain-specific accelerator. Just like there is C for general-purpose CPUs and optimized C for parallel computing in GPUs for graphics and machine learning domains, optimized C for parallel computing in domain-specific networking chips delivers much-needed application performance and computing efficiency. Or, even more domain-optimized, open and Linux-friendly P4 or eBPF programming methods can be used.
A SmartNIC designed with a traditional NIC chip combined with general-purpose CPU cores (including multi-core CPUs) or an FPGA does not by default include these domain-specific architectural features. Nor do they support domain-specific programming requirements. Significant engineering investment and time are needed to create a domain-specific architecture and related products. Some may argue that FPGAs can be programmed to serve specific applications and can be made to be domain specific. Microsoft uses FPGA-based SmartNICs in the Azure data center. Hennessy does not prescribe FPGAs as a way to solve Moore’s Law for a reason. It takes 20X the die size to implement logic in an FPGA, compared to a domain-specific silicon. RTL programming and closing timing for complex logic require significant development and only very large companies can afford the investment. And, only very large cloud infrastructures commanding high unit volumes and a diverse set of applications can amortize the significantly higher costs related to FPGAs. It is not a solution for most data centers.
As you consider SmartNICs for your data center applications, it is mandatory that you verify these requirements from the vendors. It is important to not get carried away by the speeds and feeds of the peripherals – such as the MAC, SerDes, PCIe and raw speed of access to external memory (e.g., DDR4 or HBM). While these are important, analogous to high performance tires on a race car, paying attention to what is inside, namely the engine, is significantly more important when it comes to maximizing server productivity. Because this is where domain specificity - as Hennessy emphasizes as the way for the future - resides.