A hyperscaler Ethernet switch isn’t just a big router — it’s a physics experiment. Inside that box, terabits are flying through copper traces every second.

The silicon ASICs are tuned to keep GPU clusters fed, and that design philosophy is very different from the enterprise or general-purpose data center switches you’ve seen before.

Switch Pipelines: Formula 1 for Packets

Enterprise switches are built for versatility: ACLs, VXLAN overlays, QoS, telemetry, and security features. Hyperscaler switches don’t care about that breadth. They care about raw forwarding speed and deterministic latency.

  • Latency as low as 250 nanoseconds
  • Lossless Ethernet up to 51.2 Tbps, scaling toward 102.4 Tbps in the latest Broadcom Tomahawk Ultra generation
  • Pipeline stages stripped down for consistency and reliability at massive scale
  • Congestion handled with Priority Flow Control (PFC) and ECN, not giant buffers
  • Rich queue-level isolation and telemetry

Packet Buffers: Small by Design

In an AI training cluster, GPUs generate dense east-west traffic patterns. Deep buffers create tail latency — the enemy of RoCEv2 traffic. Hyperscaler ASICs rely on shallow (compared to enterprise ASICS), on-chip buffers combined with congestion signalling.

  • enough SRAM to handle bursts at 100+ Tbps
  • Lowest possible tail latency for RDMA over Converged Ethernet (RoCEv2)
  • Avoiding external DRAM — latency is too high, power is too costly

SerDes & I/O Interfaces: The Arteries

The real scaling bottleneck isn’t the forwarding logic — it’s the SerDes.

  • Generational jumps: 56G → 112G → 224G PAM4
  • Total throughput = lane count × lane speed
  • Packaging and SerDes density drive the form factor wars: QSFP-DD, OSFP, and now co-packaged optics

PHYs: The Silent Workhorses

Hyperscaler PHYs aren’t just signal translators — they’re full DSP engines.

  • Today: 106.25G PAM4 lanes (112G with FEC overhead)
  • Next: 224G PAM4
  • Heavy FEC + DSP logic to push 400G/800G over lossy, hyperscale distances

Optics: The Lifeline

Enterprise switches are happy with plug-and-play QSFP modules. Hyperscaler switches are built around optics.

  • OSFP and QSFP-DD for 400G/800G
  • Co-packaged optics = lasers placed right next to the ASIC
  • Thermal design is increasingly dictated by optics, not just silicon

Power & Thermal: Feeding the Beast

  • High-radix hyperscaler ASICs are power-hungry
  • Optics burn 12–16W each (for 800G OSFP). With 32–64 ports, optics can exceed the ASIC’s power draw
  • Cooling becomes a rack-level design problem, not just a box-level one

Memory Subsystems: SRAM on a Starvation Diet

  • On-chip SRAM is king: buffers, queues, and lookup tables
  • No external DRAM to avoid latency penalties
  • ECN-driven lossless queues keep RoCEv2 traffic flowing without massive packet stores
  • Content-Addressable Memories (CAM/TCAM): used for forwarding lookups, ACLs, QoS classification.
  • In hyperscaler silicon, TCAM depth is can be limited to save power and die area — very different from enterprise switches that prioritize feature-rich and deep ACL/policy tables

Telemetry & Observability: Every packet has a trail

Hyperscaler ASICs are designed to expose what’s happening at nanosecond scale.

  • Queue and port level counters
  • ECN marking visibility
  • In-band telemetry to catch micro-congestion before it cascades into GPU job slowdowns

Topology: Islands and Highways

A single ASIC, no matter how powerful, doesn’t make a hyperscale cluster. The topology stitches thousands together.

  • Hyperscalers use Clos/leaf-spine fabrics: predictable latency, modular growth
  • Inside a GPU training island, you need ultra-low latency and lossless transport
  • Across data center fabrics, you need scale, multi-tenancy, and operational simplicity

That’s why Ethernet and InfiniBand coexist today.

  • NVSwitch/InfiniBand handle local GPU-to-GPU workloads
  • Ethernet (Tomahawk Ultra, etc.) scales across racks, pods, and regions
  • The real challenge: making these two worlds interoperate without breaking AI training jobs or operational tooling

The Other Half of the Puzzle

This isn’t the full blueprint of a hyperscaler ASIC. Some key topics I haven’t unpacked here:

  • Programmability: P4 vs fixed-function (Tofino vs Tomahawk)
  • Security & isolation: ACLs, segmentation, filtering stripped down or bolted on
  • Pipeline depth tuning
  • and a lot more ….

Put it all together and hyperscaler switches become a balancing act between Ethernet scale and InfiniBand latency.

If you were designing the next hyperscaler switch — would you bet on Ethernet, InfiniBand, or something new?

Hemadri Mhaskar

Leave a Reply

Your email address will not be published. Required fields are marked *