COSMOS SDK: Accelerating AI Infrastructure Time to Market

Michael Ocampo, Ecosystem Alliance Manager

COSMOS, our COnnectivity System Management and Optimization Software, was built to transform AI infrastructure operations through predictive analytics, proactive failure forecasting, and comprehensive fleet management that reduces downtime and maintenance costs across cloud-scale deployments. The COSMOS SDK, in particular, was created to manage and monitor connectivity across our Aries and Scorpio devices with a unified set of APIs. We’ve heard from our customers that validation bottlenecks have been a significant challenge in reducing time-to-market. With the COSMOS SDK, we aim to help hardware and firmware engineers address these validation challenges while keeping pace with the insatiable demands in expanding their AI infrastructure capacity.

To gain access and begin using the COSMOS SDK across your system management platforms and validation environments, please contact us or reach out to your Astera Labs sales representative.

Addressing Critical Time-to-Market Challenges

As AI infrastructure demands continue to surge, original design manufacturers (ODMs) and original equipment manufacturers (OEMs) face mounting pressure to validate and deploy systems faster than ever. After hardware design-in, the validation stage becomes a bottleneck where electrical testing and protocol debugging can significantly delay time-to-market. Without mechanisms to test security at a device level, each CPU, GPU, or ASIC could be a point of failure, vulnerable to data corruption—compromising a system.

The COSMOS SDK directly addresses these pain points by providing user-friendly test and debug capabilities that accelerate validation workflows. The SDK supports device, discovery, configuration, security attestation, firmware updates, scripting, and automation, making it a versatile solution for system designers and integrators. In AI infrastructure, where even minor network disruptions can result in significant financial losses, a software-defined approach is crucial for maintaining maximum uptime and system utilization. This programmable, adaptive approach enables real-time responses to performance degradation, automated remediation of connectivity problems, and centralized orchestration across complex distributed systems that would be impossible to manage manually at scale.

Comprehensive Infrastructure Visibility and Control

The COSMOS SDK empowers infrastructure teams to address validation bottlenecks and operational challenges by providing three primary benefits:

1. Identify Performance Bottlenecks Before They Impact Workloads
The COSMOS SDK provides comprehensive link telemetry across PCIe, CXL, and Ethernet connections, enabling teams to identify data center bottlenecks that directly impact AI workload performance. When connectivity degrades—such as a GPU link falling from PCIe 6.0 to PCIe 5.0—AI application performance can be significantly affected, with bandwidth effectively halved. The COSMOS SDK enables developers to build dashboards and visualize critical performance metrics, allowing them to address performance bottlenecks during hardware design and bring up, preventing performance degradation further downstream in the system development process. Additionally, the SDK’s continuous monitoring capabilities extend into production environments, enabling real-time detection of link degradation and performance issues in deployed systems.


An example of Scorpio Smart Fabric Switch monitoring with PCIe 5 & 6 links, powered by COSMOS APIs

2. Diagnose and Resolve Connectivity Issues Faster
When connectivity issues occur during hardware bring-up, validation, or production deployment, the COSMOS SDK provides the diagnostic tools needed for identifying root causes. It delivers detailed error counters and device logs across the physical (PHY), data link, and transaction layers, providing engineers with both quantitative data on specific link issues and qualitative insights into their severity. Reliability, availability, and serviceability (RAS) testing capabilities include error injection, fault detection, and recovery validation, allowing teams to proactively verify system robustness and quickly complete root cause analysis. This multi-layered diagnostic approach allows teams to quickly identify whether AI connectivity is running optimally and provides critical logs to promptly resolve connectivity failures—minimizing system downtime.


COSMOS SDK’s Psuedo-Random Bit Sequence (PRBS) test checks signal integrity, bit error rate, and overall electrical performance with Aries Smart Retimers and Scorpio Smart Fabric Switches

3. Streamline Fleet-Wide Management
Once AI and cloud infrastructure has been deployed, the COSMOS SDK empowers platform management tools to act like a central nervous system. Key capabilities include firmware lifecycle management with robust version control, change tracking, and governance capabilities. It also enables teams to configure, monitor, and manage AI fabrics with complex topologies. Beyond traditional management functions, the COSMOS SDK enables firmware signing, key provisioning, secure management interfaces, and debugging. This comprehensive approach ensures consistency while reducing operational overhead and minimizing risks across large-scale deployments—whether during firmware updates, link health monitoring, error diagnostics, or topology management—which is critical for maintaining reliability across tens of thousands of interconnected production systems.


COSMOS SDK supports platform APIs to enable rack-scale management of Aries Smart Retimers and Scorpio Smart Fabric Switches

Ready to Transform Your Infrastructure Development?

Leading ODMs and OEMs have already integrated the COSMOS SDK into their baseboard management controller (BMC) systems to enable advanced fleet management, link telemetry, and RAS capabilities. Contact your Astera Labs sales representative to learn more about these ODM and OEM partnerships and compatibility options for your specific infrastructure requirements. A product brief, containing complete features and implementation guidance are available in the Astera Resource Center.

About Michael Ocampo, Ecosystem Alliance Manager

Michael is an evangelist for open ecosystems to accelerate hybrid cloud, Enterprise and AI solutions. With over a decade in x86 system integration, IaaS, PaaS, and SaaS, he offers valuable customer insights to cloud and system architects designing high-speed connectivity solutions for AI Training, Inferencing, Cloud, and Edge Computing Infrastructure. At Astera Labs, he leads ecosystem alliances and owns the Cloud-Scale Interop Lab to drive seamless interoperability of HW and SW solutions to optimize TCO and performance of infrastructure services.

Share:

Related Articles