Niel-Viljoen-1

Why Bare Metal Does Not Have To Be Nude

By Niel Viljoen | Oct 5, 2016

Oracle Chairman Larry Ellison made a splash recently during Oracle Open World 2016 announcing that Oracle will give Amazon a run for its money in public cloud provisioning. The core differentiator mentioned is to give users access to “bare metal servers” that are far more performant than Amazon’s AWS offerings. This is achieved by “software that runs on our special network interface adapter card,” according to Ellison.

This is achieved by moving the hypervisor network processing overhead into the network interface card rather than consuming x86 cores on the server to achieve this goal. Netronome’s Agilio SmartNICs and software are examples of products that enable this type of offload.

This highlights a dirty little secret that IaaS vendors have not been telling their customers (to be fair they have not really been hiding it ☺ ) - they have been “stealing” CPU cycles from customers to run the network infrastructure inside the server. In other words, when a customer buys a virtual CPU (vCPU) from an IaaS vendor, the vCPU provisioned using (n) CPU cores does not deliver the same performance as a real CPU with the same number of cores. Why does this happen?


Customer View of vCPUs Diagram 2

Physical Cores Stolen for Networking Infrastructure Diagram


The diagram above gives us a clue. Starting at the bottom of the diagram is a server with a given number of cores (n). However, when those are sold to customers there has to be a way to network them. In most IaaS environments this is done using software called virtual switches (vSwitches) to create the networking infrastructure. The processing required to support this consumes an increasingly significant number of the cores as complexity and network speeds increase. We have published validated numbers showing up to 12 cores being consumed when performing 40Gb/s networking as an example. The stark reality is that on a 24-core physical server (typical industry standard) you would need to DOUBLE the number of servers to achieve a given task.

With the server at the top of the diagram, we show how the “lie” is then perpetrated. vCPUs are sold but in fact the connection to real CPUs in now tenuous because the vCPUs do not have the equivalent performance of the physical server CPUs.

Now enter Oracle. Their primary focus is on making available the true complement of processor cores to their customers. How do they do this?

The key secret can be found in the humble network interface card installed in the server. As shown in the diagram below, all the processing for the network infrastructure is moved into a new class of network interface card, which many refer to as a SmartNIC (Netronome’s Agilio line of server adapters represent this class of product). SmartNICs typically have processing cores that are able to switch, process and secure traffic to the virtual switches. As a result, vCPUs can now perform at the same level as physical CPUs, which leads to significant performance gains and potential savings for customers. This is what the industry and Oracle terms “bare metal servers.” BUT the story does not end there!


Customer View of vCPUs Diagram


Moving the virtual switching, security and monitoring into a SmartNIC in the figure above and performed in the way shown with “bare metal servers” does have downsides. The downsides are particularly acute for carrier customers, smaller providers and enterprise customers that have to operate their own clouds (hybrid or private).

Why? A large company such as Oracle can afford to build and maintain all of their own software ($$$$$) but others cannot necessarily do it as well and need to use the ecosystem of vendors, open source projects and internal development to achieve their goals.

Some ecosystem examples include:

1.OpenStack: a rapidly growing toolset to manage clouds that relies on standard Linux Bridges, IP Tables and Open vSwitch-based mechanisms to provision the cloud infrastructure.

2.Security policy management tools from OpenStack and others.

3.Analytics applications and interfaces are built for current vSwitches and will be difficult to port to the Oracle model.

4.Configuration tools such as Chef and Puppet when related to network automation are being built around standard hypervisors.

Is there a better way?

Yes. The simple answer is to keep the brains of the hypervisor and vSwitch running exactly as they are today using the server CPU so that the ecosystem is not affected, but to move all of the complex processing into the SmartNIC. This leaves all the management applications mentioned above in place yet gains 95% of the same benefits as the Oracle “bare metal server” model delivers. The use of VMs and related efficiencies of not “stealing” CPUs remain intact. This is a more scalable model considering that it does not require a veritable army of software developers to maintain and evolve!


Customer View of vCPUs Benefits Diagram


In others words get BARE METAL performance without being NUDE. NUDE implies a server with little ability to use the ecosystem of standard open source software and a vSwitch that makes use of OpenStack and VMs impossible.

With Netronome Agilio products, get BARE METAL performance without having to go NUDE!