DriveNets accidentally solved one of the biggest emerging issues in the cloud, unveiling new software which will allow hyperscalers and enterprises alike to connect as many as 32,000 GPUs with good old Ethernet technology to run massive artificial intelligence (AI) workloads. Whoopsie?
Today, AI workloads are typically connected using either NVIDIA’s InfiniBand or Ethernet technology. But there are problems with each.
650 Group co-founder and analyst Alan Weckel told Silverlinings “there’s nothing directly wrong with InfiniBand.” Indeed, it offers the massive scale needed for AI workloads. But, Weckel said, “there is only one major vendor, NVidia, and this creates vendor lock-in.”
Meanwhile, Ethernet is prevalent throughout the data center, but the traditional leaf-and-spine architecture used wasn’t designed to support the massive scale AI workloads will require.
That’s where DriveNets’ new Network Cloud-AI solution comes in. The technology is based on the Open Compute Project’s Distributed Disaggregated Chassis (DDC) routing system, which essentially updates the traditional model to create a distributed leaf-and-spine architecture.
Inbar Lasser-Raab, DriveNets Chief Marketing and Product Officer, told Silverlinings it initially adopted the DDC model to create its Network Cloud software. That software delivers a high-scale routing system that is being used by the likes of AT&T for its core. As of January, DriveNets said its Network Cloud was carrying just over half of AT&T’s core traffic.
Over the past nine months or so, Lasser-Raab said DriveNets realized the same distributed architecture used for Network Cloud could be applied to data center networking for AI clusters. Hence, Network Cloud-AI was born.
Network Cloud-AI slashes the loss inherent with traditional Ethernet to deliver zero packet loss. It also ensures maximum utilization of GPU resources by equally distributing traffic to reduce downtime. Together, these improvements contribute to a 10% to 30% boost in job completion times, which, when you’re talking about super massive AI clusters, can translate to tens or hundreds of millions of dollars in savings. And all of it comes in an open package that allows hyperscalers to mix and match white boxes, network interface cards and ASICS from different manufacturers.
These headline features have already caught the eye of hyperscalers. Lasser-Raab said “all of them” are already in trials, but didn’t name specific companies.
Early trials have focused on groups of several thousand GPUs. Run Almog, Head of Product Strategy at DriveNets, said all the companies he’s spoken with recently about AI clusters are looking to deploy clusters with numbers of nodes in the “high thousands. Not 16[,000], not 20[,000], but 10,000, 8,000 nodes.” He added hyperscalers are ultimately looking to boost that figure as high as 32,000 once Broadcom unveils its new Jericho 3 switch system on chip (SoC).
Almog noted hyperscalers are so eager to deploy its solution that some are even looking at starting deployments using Broadcom’s existing Jericho 2c+ SoC.
Asked about deployment timelines, Lasser-Raab said “For the most part, 2024 will be the year of AI infrastructure buildouts.”
DriveNets’ Network Cloud-AI launch comes amid increased interest in AI applications which is expected to drive spending in the coming years. IDC has forecast that global spending on AI software, hardware and services for AI-centric systems will hit $154 billion this year and nearly double to over $300 billion by 2026.
And according to Weckel’s 650 Group, spending on AI networking is expected to rise from $2 billion in 2022 to $10.5 billion in 2027. Of the latter figure, approximately $4 billion will be spent on InfiniBand, with the remaining $6.5 billion going toward some kind of Ethernet solution.
“The $6.5B is the addressable market for DriveNets. It was less than $500M in 2022, with significant growth and an emerging catergory and a lot of opportunity for new vendors,” Weckel told Silverlinings. He noted DriveNets will likely be facing off with switch vendors, Arista, Cisco and Juniper as well as NVIDIA’s Ethernet Switch ASIC. But he added “there isn’t really an incumbent advantage in AI because it is a new and unique network.”