Combating Noisy Neighbors with Scorpio P-Series Fabric Switches

Wesley Yung, Director, Product Line Management

AI server designs are being impacted by an issue that becomes increasingly worse as GPUs scale to meet demands of AI workloads. The issue: noisy neighbors! Scorpio P-Series Fabric Switches – the industry’s first PCIe 6 fabric switch – are architected for mixed traffic AI head-node traffic connectivity (GPU-to-CPU/NIC/SSD). Let’s take a closer look at the problem of noisy neighbors and how Scorpio P-Series solves the problem.

The Challenge: Partitioning and “Noisy Neighbors”

PCIe switches provide connectivity between clusters of GPUs and other essential elements in a single server domain. In many cases, data flow within the server is made up of disparate connectivity islands of GPU, CPU, NIC, and storage that share the same underlying switch hardware.

In Figure 1 below, the general-purpose switch is configured into two partitions (synthetic mode and virtual switch partition mode), operating one device as two separate switches from the CPU’s perspective.

Figure 1: Connectivity islands sharing a single PCIe switch

This logically breaks the switch into partitions and operates one device as two separate switches from the perspective of the CPU. One of the challenges that arises from this type of architecture is the shared switch core in this deployment is susceptible to packet contention between the two islands which can have a negative impact on performance. 

Figure 2: Impact of noisy neighbors between connectivity islands

In Figure 2 above, if the GPU on the left is streaming data from the NIC in blue, there may be traffic from the SSD in orange competing for the same switch core arbitration logic. Traffic from the SSD in orange is now in flight which may delay packets between the NIC and GPU in blue. 

The Solution: Purpose-built Fabric Switches for AI Platforms

The same connectivity islands can be deployed with a hardware separation by deploying two smaller, purpose-built switches like the Scorpio P-Series Fabric Switches in Figure 3 below. This design is functionally equivalent from the perspective of the host.

Figure 3: Scorpio P-Series enables connectivity islands that protect GPUs from noisy neighbors

In Figure 4 below, Scorpio P-Series Fabric Switches provide the same connectivity islands as a larger monolithic switch. With Scorpio P-Series Fabric Switches, the GPU/CPU/NIC connectivity island is connected through a dedicated switch core and arbitration logic – preventing GPU traffic from one island from affecting the other.

Figure 4: Scorpio P-Series Fabric Switches operating independently to enable high bandwidth GPU/CPU/NIC/SSD data

The dedicated hardware provides completely independent traffic flows between each connectivity island delivering unparalleled performance and zero impact from the adjacent noisy neighbor.

To learn more about how to overcome AI server design challenges and deploy Scorpio P-Series Fabric Switch solutions into your platform designs, request the white paper: Migrating AI Server Designs to a Modular Scorpio Architecture.