Not Your Standard Ethernet: The Industry-Wide Revolution to Dethrone InfiniBand

Standard Ethernet can't handle AI. Early attempts were brittle patches on a flawed foundation. Now, a massive industry alliance is re-architecting it from the ground up. This is the story of the Ultra Ethernet Consortium's revolution.

Not Your Standard Ethernet: The Industry-Wide Revolution to Dethrone InfiniBand

In our last post, we explored the reign of InfiniBand, the undisputed king of high-performance networking. We saw how its lossless design, native RDMA, and in-network computing capabilities made it the go-to choice for building AI superclusters. But we also saw the high walls of its garden—a proprietary, single-vendor ecosystem that presents a massive strategic risk and economic burden for the industry's largest players.

The market has been crying out for an alternative. An open, cost-effective, multi-vendor fabric with the performance to rival the king.

The obvious candidate? Ethernet. It's everywhere. It's cheap. It's the lingua franca of the digital world. But standard, off-the-shelf Ethernet is fundamentally unfit for the brutal demands of large-scale AI. This isn't a story about plugging in a different cable. This is the story of a great transformation—of the entire industry coming together to fundamentally re-architect Ethernet for the AI era.

ℹ️
This post is part of a series, you can see the complete series here.

The First Attempt: The Promise and Peril of RoCE

The first major effort to bring high performance to the Ethernet world was a clever idea called RoCE (RDMA over Converged Ethernet). The premise was simple and seductive: let's get the low-latency benefits of RDMA, but run it over the massive, familiar, and economically efficient Ethernet ecosystem.

On paper, it was the best of both worlds. And in small, controlled environments, it worked! RoCE could deliver dramatic performance improvements over standard TCP/IP networking. Many cloud providers and enterprises adopted it as a "good enough" solution.

But a fundamental flaw lurked beneath the surface. RoCE is an attempt to run a lossless protocol (RDMA) on top of an inherently lossy network (Ethernet). To make this work, engineers had to bolt on a series of complex mechanisms to prevent the packet loss that would cripple RDMA performance.

The two primary tools were:

  1. Priority Flow Control (PFC): A mechanism that allows a switch to send a "PAUSE" signal upstream when its buffers get full, preventing packet drops.
  2. Explicit Congestion Notification (ECN): A way for switches to mark packets to signal impending congestion, managed by a complex algorithm called DCQCN.

The problem is that these are patches, not native features. At scale, they started to break down in spectacular fashion. PFC, the pause mechanism, turned out to be a blunt instrument. A single congested link could trigger a pause, which could then cascade backward through the network, creating a "PFC storm" that brought huge sections of the fabric to a grinding halt.

Tuning the DCQCN algorithm proved to be a black art. In a now-famous 2024 paper, engineers from Meta detailed their heroic, multi-year struggle to get RoCE to work reliably in their AI clusters. They described the system as brittle, difficult to debug, and prone to silent performance failures. They ultimately concluded that the RoCE architecture was not a viable long-term solution for hyperscale AI.

RoCE was a necessary and important evolutionary step. It taught the industry valuable lessons. But its shortcomings made one thing clear: to truly challenge InfiniBand, Ethernet didn't need a patch; it needed a revolution.

A Revolution, Not an Evolution: The Ultra Ethernet Consortium

In response to RoCE's limitations and the strategic imperative to create an open alternative to InfiniBand, the Ultra Ethernet Consortium (UEC) was formed in 2023.

This is not just another standards body. The UEC represents an unprecedented industry-wide collaboration. It brings together fierce competitors under a single banner, all united by a common goal. The founding members are a who's who of the tech world:

  • Hyperscalers: Meta, Microsoft
  • Chip Giants: AMD, Intel, Broadcom
  • Systems Vendors: Arista, Cisco, HPE

Their mission is not to create a "better RoCE." It is to define a new, open, and interoperable communication stack that is purpose-built for AI and HPC from the ground up.

The centerpiece of this revolution is the Ultra Ethernet Transport (UET) protocol. UET is a clean-slate redesign of the transport layer that fundamentally changes how Ethernet behaves. It's designed to natively solve the core problems of AI workloads—congestion, tail latency, and load balancing—that RoCE could only patch over.

UET introduces several groundbreaking concepts:

  • Advanced Multi-pathing: It moves beyond the flawed, per-flow load balancing of traditional Ethernet and "sprays" the packets of a single message across all available network paths, ensuring the entire fabric is used evenly.
  • Flexible Delivery Order: It decouples packet order from message order. Packets can arrive at the destination out of order, where the NIC reassembles them. This completely eliminates the head-of-line blocking that plagues RoCE.
  • Modern Congestion Control: It uses a new, highly responsive congestion control mechanism that can react to network conditions in microseconds, not milliseconds, preventing traffic jams before they even form.

This is a paradigm shift. It transforms Ethernet from a connection-oriented, in-order protocol into a message-oriented, out-of-order fabric—exactly what AI workloads demand.

The industry has placed its bet. The consensus is clear: the future of AI networking will be built on an open standard. The Great Transformation of Ethernet has begun.

Next up: Packet Spraying & Flexible Order: The UEC's Secret Weapons for AI Networking. We'll take a deeper technical dive into exactly how the UEC's innovations are designed to close the performance gap with InfiniBand for good.

Further Reading