Data centers today have a lot of servers, and within each server there is an abundance of storage, specialized accelerators, and networking/communications infrastructure. These represent tens of thousands of interconnected systems, and with the rise of hyperscalers and cloud service providers, the scale of data infrastructure is only expected to grow in the years to come.
To get the most performance and uptime out of their data centers, Astera Labs’ customers are deploying its Aries Smart Retimers, which support both PCIe® 4.0 and 5.0, and effectively remove data center bottlenecks by extending physical reach by up to 3x with <10ns latency. Aries is also ideal for Compute Express Link™ (CXL™) applications where latency performance is even more critical.
These customers must also deploy robust fleet management capabilities to optimize data center performance and total infrastructure uptime – ensuring that all servers are running at peak level while predicting potential points of failure.
Going beyond channel reach extension, Astera Labs’ Aries Smart Retimers deliver deep diagnostics, enabling a powerful array of link health monitoring tools for data center server fleet management. The Aries Software Development Kit (SDK), deployed on the baseboard management controller (BMC) enables large-scale monitoring and resource optimization, allowing customers to gain detailed analytics from thousands of datapoints on how their links are performing in real-time.
With Aries PCIe Smart Retimers deployed on servers, storage systems, accelerator trays, and other equipment, BMC applications can make use of real-time link health monitoring to impact resource allocation decisions during provisioning. As you can see in the example below, using data gathered in real-time by the Aries SDK, customers can monitor link health and predict failure, enabling them to make key decisions toward maximizing their PCIe bandwidth.
The Aries SDK also enables critical capabilities beyond fleet management. Customers can use the same software for automated validation, requiring less customer time on the bench. We’ll cover this topic in more detail in a future blog.
Any component placed in a critical PCIe 4.0/5.0 data path must have robust performance which is monitorable and recoverable. When operating a large data center with tens of thousands of servers, storage boxes, and GPU/accelerator trays, it is imperative to know which resources are in a healthy state—and can therefore be dynamically assigned to customer workloads—and which resources require maintenance. This capability is critical to infrastructure efficiency and is impactful to the make-or-break Total Cost of Ownership (TCO) calculation in data center operations.
Data collection is key to solving this resource optimization challenge. Request the Fleet Management Made Easy white paper to learn how the Aries SDK plays a key role in monitoring the critical PCIe infrastructure.