Unlocking Cloud Server Performance with CXL

Ahmad Danesh, Sr. Director, Product Management
Sandeep Dattaprasad, Sr. Product Manager

Our personal and professional lives are centered around creating and consuming information while our expectations for real-time access to data continue to increase. We use and rely on services daily, from social media and online shopping to search engines and rideshares, and these services are made possible through data processed by large-scale cloud servers. With the amount of data increasing significantly every year – to 181 zettabytes (1 ZB = 1 trillion gigabytes) by 2025 (Figure 1) – Cloud Service Providers (CSPs) need to develop new architectures to store, analyze and deliver the growing volume of data.

Volume of data expected to be created and used through 2025
Figure 1: Volume of data expected to be created and used through 2025 (Source: Statista, Worldwide Data Created, Aug 2022)

Removing Memory Bottlenecks to Process Data at Scale

Behind the applications we use daily are cloud servers that run Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) models. ResNet, BERT, GPT-3 and other models require hundreds of GBytes of DRAM to store, analyze, and compute hundreds of millions of parameters for training and inference. The challenge that CSPs face today is how to process the increasing volume and complexity of models, which has been outpacing Moore’s Law and doubling every 3.4 months over the last 10 years (Figure 2).

ML complexity growing exponentially in the Modern Era far outpacing Moore’s Law
Figure 2: ML complexity growing exponentially in the Modern Era far outpacing Moore’s Law (Source: Statista, Worldwide Data Created, Aug 2022)

Processing these large and complex datasets requires high-performance processing (CPUs, GPU, and accelerators) with access to high bandwidth and large capacity memory. Server processors, including Intel® Xeon® and AMD EPYC™, continue to deliver higher core counts to address the need for more processing power. However, there are physical limitations that have prevented memory from scaling with the increasing core counts. Each DDR4 or DDR5 RDIMM has 288 pins, which limits the number of DDR channels that can be supported on a CPU package. The outcome is declining memory bandwidth per core (Figure 3) and lower system performance which is critical for AI/ML workloads.

Increasing core counts with declining memory bandwidth per core
Figure 3: Increasing core counts with declining memory bandwidth per core (Source: Software Defined Memory: A Meta perspective, OCP Global Summit, 2021)

The industry has been working hard to enable a solution to increase memory bandwidth to unlock the performance required for next-generation data centers. Compute Express Link™ is an open standard developed to provide high-speed, low-latency, cache-coherent interconnect for processors, accelerators, and memory expansion. Other standards, such as Gen-Z and OpenCAPI/OMI, also enabled a similar solution and did not get the same industry traction as CXL™. More recently, the CXL Consortium has announced that the assets and specifications of Gen-Z and OpenCAPI/OMI will be transferred to the CXL Consortium – making CXL the industry-adopted solution for serial-attached memory expansion and cache-coherency in the data center.

When deploying servers at scale, it is critical to look for efficiencies to deliver the required performance for different applications while reducing the Total Cost of Ownership (TCO). In today’s server architectures, memory is directly connected to a CPU or GPU and cannot be shared by other CPUs or GPUs. In a common scenario referred to as memory stranding, all the CPU cores are being used while there is unused memory connected to that CPU. With DRAM accounting for up to 50% of server costs and up to 25% of memory being stranded at any given time, new heterogeneous architectures enabled by CXL can increase memory utilization and decrease TCO. These new architectures can also improve system performance and dataflow efficiency between processors by sharing memory resources over the low-latency CXL interface and avoiding copying data back and forth over high-latency interconnects.

What To Look for in a CXL Memory Controller

In today’s server architectures, multi-socket CPUs are used to enable memory expansion, which is an expensive solution when the additional CPU cores are not needed. CXL memory controllers can provide a cost-effective and high-performance solution to expand memory bandwidth and capacity. All CXL memory controllers can enable memory expansion, but not all controllers have the features required to deploy at scale, unlock the performance for complex AI and ML workloads, and reduce TCO. CXL-enabled CPUs and GPUs from the leading vendors are expected to launch in the data center soon, so it is important to understand the key features and factors to consider when selecting a CXL memory controller for cloud-scale deployments.

Memory Expansion

Cloud servers have multiple CXL x16 interfaces available and it is expected that multiple CXL memory controllers will be added to these servers depending on the application-specific workloads. To maximize performance at lower TCO, using memory controllers with a x16 CXL interface will be better suited than using twice as many controllers with a x8 CXL interface. Since each workload has different capacity requirements, it is also important for CXL memory controllers to support existing and future DRAM capacities. With the right design, CXL memory controllers should be capable of supporting a x16 CXL interface, up to four DDRx RDIMMs and up to 2TB of memory per controller.

Memory Pooling and Sharing

While memory expansion will enable increased performance, CXL memory controllers supporting memory pooling can enable heterogeneous architectures to eliminate memory stranding, reduce TCO and increase performance and those supporting memory sharing can optimize dataflow efficiency and increase performance.

Memory pooling can be accomplished in two ways based on the CXL specification. CXL 2.0 defined memory controllers with multiple logical devices (MLD) to enable memory pooling with a CXL switch, while CXL 3.0 formally defined memory controllers with multi-headed single logical device (MH-SLD) memory controller to enable memory pooling without a switch. It is important to know, however, that CXL 1.1 and 2.0 CPUs can use MH-SLD for memory pooling and do not need to wait for CXL 3.0 CPUs to take advantage of this unique capability. Similarly, memory sharing can be enabled with multi-headed memory controllers or switch-attached memory controllers. The key advantage of multi-headed memory controllers versus switch-attached memory controllers is performance due to the additional latency of the CXL switch.

Memory Pooling Architectures enabled by CXL
Figure 4: Memory Pooling Architectures enabled by CXL

Reliability, Availability and Serviceability

All processors, from low-power CPUs used in your smartphone and laptops to server CPUs used in data centers rely on memory to store information for immediate processing. The scale at which memory is deployed in a data center requires server-grade Reliability, Availability and Serviceability (RAS) so factors such as memory errors, material degradation, environmental impacts, or manufacturing defects do not impact application performance, uptime, and user experience. Each CSP has unique requirements, so it is important for CXL memory controllers to not only have server-grade RAS features, but to also have customizable RAS features to enable each CSP to tailor the memory subsystem to fit their needs. Ensuring the highest RAS at scale also requires extensive telemetry features to collect real-time information about the memory subsystem and enable fleet management to monitor, diagnose and service the equipment across the data center.

Security

Our private data, from medical and financial records to personal photos and videos, are stored in the Cloud. Security breaches from external software and social attacks are announced in the news every day, with attackers getting access to usernames and passwords. Within a data center, ensuring high security is even more important as the data of millions of users are accessible. Several security mechanisms are required to protect against attacks within a data center with a growing trend referred to as Confidential Compute to isolate and encrypt data being used by virtual machines and the underlying hardware. With user data being stored in memory, end-to-end security features, such as secure Root of Trust (RoT), firmware authentication, and encryption, are required in CXL memory controllers to protect user data and ensure robust operation of the controller.

Scale and Flexibility

Each generation of server CPUs has local DDRs channels, and the industry is currently transitioning from DDR4 to DDR5 DRAM. Similar to storage tiering that enabled significant data center optimizations with various storage media types, there is now a growing trend for memory tiering to enable CPUs to not only connect to standard JEDEC DDRx, but also enable memory expansion to emerging memory solutions such as persistent memory. With CXL memory controllers, any memory type can be supported to optimize performance and cost for application-specific workloads.

Data centers use various capacities of RDIMMs throughout their fleet today, and since the cost of DRAM accounts for up to 50% of server costs, the supply chains have been optimized to decrease costs by leveraging memory from all the top memory vendors. Data center operators need the flexibility to deploy the same RDIMMs that are used in other areas of their fleet and upgrade their solutions to use the latest speed and capacity RDIMMs available on the market for CXL-attached memory. CXL memory controllers that are purpose-built for data centers will enable flexibility to use existing and future RDIMMs to ensure their supply chains are not interrupted and have the lowest TCO while enabling serviceability at scale. Some CXL memory controllers are designed for E3.S memory drives, which provides a fixed memory configuration and limited flexibility for cloud deployments.

Performance

While CXL enables memory expansion, it is important to take advantage of the full bandwidth enabled by the CXL interface and consider the latency impact of the controller. A CXL Gen5 x8 interface is bandwidth-matched to 1 channel of DDR5 @ 5600MHz memory (CXL Gen5 x16 matched to 2 channels of DDR5-5600), so lower DDR speeds will not fully utilize the available CXL bandwidth. It is important for CXL memory controllers to have a low-latency data path and support DDR5-5600 RDIMMs to maximize performance.

Interoperability

With CXL being a new standard, it is imperative to ensure robust interoperability. Existing PCI-SIG compliance and upcoming CXL compliance workshops will enable vendors to test CXL memory controller compliance to the specifications, but cloud-scale deployments will require more testing to ensure a robust ecosystem. CXL memory controller vendors will require cloud-scale interoperability and close partnerships with CPU, GPU, and memory vendors to perform rigorous testing beyond the compliance workshops.

Introducing Leo Memory Connectivity Platform

Leo Memory Connectivity Platform – the industry’s first solution to support memory expansion, memory pooling, and memory sharing using CXL 1.1 and 2.0 – is purpose-built to unlock the performance needed for AI/ML workloads and decrease TCO for cloud-scale deployment. While other solutions continue to show FPGA-based proof of concepts and promise to deliver CXL technology in the future, Astera Labs is proud to deliver fully functional SoCs with actual deployments for data center racks running real workloads today. Leo CXL Memory Connectivity Platform includes a comprehensive portfolio of controllers and off-the-shelf hardware solutions for plug-and-play deployment by CSPs and OEMs.

Leo Smart Memory Controllers and Aurora A-Series Hardware Solutions
Figure 5: Leo Smart Memory Controllers and Aurora A-Series Hardware Solutions

With server-grade customizable Reliability, Availability and Serviceability (RAS), end-to-end security, extensive fleet management capabilities and seamless interoperability with all major CPU, GPU and memory vendors, Astera Labs has quickly gained traction as the preferred vendor by CSPs and OEMs.

leo-comprehensive
Purpose-Built for Cloud

Comprehensive portfolio of purpose-built SoCs and hardware solutions for cloud-scale deployment targeting workloads such as AI and ML

leo-secure
Customizable RAS & Security

Server-grade customizable RAS, end-to-end security features, and software tools to integrate with fleet-management services

leo-scalable
Low-Latency DDR5 & Custom Memory

Flexible and scalable memory interface with low-latency data path to support JEDEC DDR5 and custom memory interfaces

leo-interoperable
Seamless Interoperability

Seamless interoperability with all major CPU, GPU and memory vendors, making it easy to manage, debug, and deploy at scale

 

Conclusion

As we create and consume more data and our expectations for real-time data access increases, cloud architectures need to evolve to meet the performance requirements of the services we use. Processing data at scale with compute-intensive AI and ML workloads has been hampered by memory bottlenecks that can now be addressed with CXL memory controllers. Additionally, CXL will enable new heterogeneous architectures to optimize performance and TCO with memory pooling and sharing.

Leo CXL Memory Connectivity platform is the industry’s first solution to eliminate memory bottlenecks, increase performance and reduce TCO through memory expansion, pooling and sharing. Leo Controllers have extensive features required for cloud-scale deployments, including server-grade RAS features, end-to-end security, fleet management, low-latency DDRx, and custom memory support, and have seamless interoperability with all major CPU, GPU, and memory vendors.

Leo CXL Memory Connectivity Platform is offered in several product solutions:

  • Leo E-Series CXL Memory Controller SoC supporting memory expansion
  • Leo P-Series CXL Memory Controller SoC supporting memory expansion, memory pooling, and sharing
  • Aurora A-Series CXL Memory Hardware Solutions for plug-and-play deployment

Astera Labs is excited to be the first to enable CSPs and OEMs to address memory bottlenecks with fully functional SoCs running on real workloads in the data center. With new advancements in performance and scale enabled by CXL 3.0, and the power, performance, cost and time to market advantages of UCIe, Astera Labs will continue to deliver innovative and purpose-built connectivity solutions to remove performance bottlenecks for the next generation of cloud servers.

To learn more about the class-defining Leo CXL Memory Connectivity Platform for CXL 1.1 and 2.0, and other purpose-built solutions to remove performance bottlenecks throughout the data center, please visit asteralabs.com/leo.