Cloud

InfiniBand or Ethernet? Which better suits AI networking fabric?

Generative artificial intelligence (AI) is rapidly growing in applications and popularity, and with it AI infrastructure buildouts and demand for high-scale compute resources.

China’s hyperscalers are splashing billions of USD on Nvidia gear, to keep pace with Western hyperscalers that are building larger and larger AI supercomputers to accommodate the rapidly growing datasets training of their AI models.

The networking challenge

This massive buildout of AI infrastructure is creating the need for a high-performance networking fabric. As a result, inter-GPU connectivity has become a crucial element for the performance of AI workloads and efficiency of the AI infrastructure.

Though accounting for less than 10% of the typical cost of a large AI compute cluster (where the GPUs hold the lion share of the cost), an underperforming networking infrastructure can reduce the performance of the entire AI cluster, measured in job completion time, or JCT, by tens of percentages.

In her keynote at the Open Compute Project (OCP) Summit, in late 2022, Alexis Bjorlin, Meta’s VP of infrastructure, highlighted the growing gap between compute capabilities and the surrounding network capabilities. She also shared some amazing figures regarding the percentage of compute idle-time spent awaiting the network to deliver the needed AI payloads. These compute resources are simply wasted waiting, causing longer JCT, or requiring a larger (and much more costly) compute cluster in order to perform a given task on time.

This networking bottleneck is an unacceptable situation in which an expensive, strategic infrastructure (AI compute) is limited by a secondary, much less costly element: the network.

Affect on overall performance

To understand how the networking fabric affects the overall performance of an AI cluster, we need to take a look at how the AI training process works.

As the process is far too compute-intensive to run on a single compute element (GPU or other AI processor), instead it runs, in parallel, on multiple GPUs. The number of GPUs has grown from 10s to 100s and lately to 1,000s and even tens of 1,000s GPUs, in a single cluster, running the same job.

The way this is achieved, as explained by Nidhi Chappell, Microsoft’s GM of Azure Generative AI and HPC platforms, is by partitioning the AI computation workload across all those GPUs. This workload runs in parallel — and in phases called allreduce, information is shared, or synched, between the GPUs.

This information, which is transferred  between GPUs, run on a designated networking fabric, often referred to as the back-end network, connecting all the GPUs in the cluster. Due to the traffic volume, this network connection is per GPU, and not per server (that can accommodate up to 8 GPUs).

AI networking options

There are several networking technologies that support this AI fabric infrastructure.

  1. InfiniBand — the dominant technology, so far. InfiniBand was purpose-built for supercomputer connectivity. It is practically a Nvidia closed garden. While InfiniBand provides adequate performance, it is suitable for an isolated infrastructure and, according to Nvidia’s CEO, Jensen Huang, who described this market evolution in his keynote at Computex 2023, in order for Generative AI to grow and become present in public datacenters and cloud infrastructure, a move toward Ethernet-based fabric needs to occur.
  2. Ethernet – this is the de-facto global standard for any connectivity within the datacenter. The issue with Ethernet, though, is that it is, by nature, a lossy technology. This nature becomes dominant as the number of elements connected to an Ethernet network grows and — more than that — as the traffic utilization of the network exceeds 30% - 50%. Under these conditions, congestion starts to occur in different parts of the network, and phenomena such as head-of-line blocking and incast cause jitter (variation in latency) and frame/packet loss. This means that the AI job is delayed, causing a longer job completion time, and, in cases of severe packet loss, this job could be halted, forcing it to “rewind” to the last checkpoint or to restart altogether.
  3. DDC - To have an Ethernet-based solution which is lossless and predictable (like InfiniBand) but does not incur packet losses and is consistent in its performance, a different approach to Ethernet infrastructure is required. Such an approach was introduced, for a completely different use case, by the OCP as it accepted the Distributed Disaggregated Chassis (DDC) specifications for high-scale networking.

In Ethernet DDC, the external interfaces are Ethernet but the internal (backplane) connectivity is a cell-based scheduled fabric which is lossless and predictable, distributed across multiple white boxes.

In this architecture, there are two types of white boxes, the NCP (network cloud packet forwarder, also referred to as DCP) and the NCF (network cloud fabric, or DCF). Unlike a chassis-based solution, NCP white boxes are not bound by a physical or mechanical enclosure and can be deployed across multiple racks across the data center and act as a top-of-rack switch, connecting the servers within this rack, while the inter-rack connectivity is done over connections between and NCPs and the NCFs, which are cell-based, hence lossless and predictable.

The connectivity scheme is illustrated in the following drawing:

DriveNets

It is important to understand that in such an architecture, the entire NCP and NCF constellation acts as a single (very large) Ethernet entity, so a connection between any GPU to any GPU will hop through a single Ethernet node, while in a similar physical Clos architecture, in which all nodes are Ethernet switches, there will be up to five (and, in some cases, seven) Ethernet hops between GPUs.

The internal fabric and the overlaying software include mechanisms for lossless connectivity, such as a virtual output queue (VOQ) held in the ingress port, which prevents HOL blocking, as well as wire-speed failover mechanisms that ensure continuous fabric availability for the AI workloads.

This architecture could be viewed as a virtual chassis, which can scale to 10Ks of high-speed (up to 800Gbps) Ethernet ports, connecting the entire AI cluster, as illustrated below:

DriveNets

No doubt that things will further develop and evolve.  But DriveNets will most likely evolve with it as its solutions already demonstrate higher AI performance than other Ethernet solutions, and the company is already working with industry pioneers who are building the largest Ethernet infrastructures.

The editorial staff had no role in this post's creation.