This week at Supercomputing 2025, the AI infrastructure community is converging to explore the next generation of technologies enabling breakthrough performance at scale. Among the most critical challenges being addressed is the memory bottleneck that increasingly constrains AI deployments.
AI workloads, particularly Large Language Model (LLM) applications, are fundamentally memory-bound. Traditional server architectures cannot keep pace with the exponential growth in memory capacity and bandwidth requirements.
The solution lies in Compute Express Link® (CXL®) technology—and specifically, in understanding how CXL memory expansion unlocks performance for the AI applications defining the next era of infrastructure.
Evolution of Modern AI Inference with RAG
As AI adoption continues to grow exponentially, the AI inference market is expected to grow from USD 106.15 billion in 2025 to USD 254.9 billion by 2030 with a CAGR of 19.2% from 2025 to 2030. [1] In order to deliver accurate and timely responses, inferencing requires a different technology infrastructure stack than those used for training AI models. Retrieval-Augmented Generation (RAG) has emerged as a powerful technique to customize LLMs for users/use cases and can also help drive down the costs of serving up AI inference. It integrates information-retrieval with generative models to provide accurate, grounded, and context-rich responses.
RAG-based models use specific custom data to apply up-to-date knowledge that generate more accurate, relevant, and trustworthy responses. They also require storing large datasets and embeddings in memory.
Memory Challenges in RAG
RAG systems must maintain massive vector databases in memory for fast retrieval. These databases store high-dimensional embeddings—typically 768 to 2048 dimensions per vector—for millions or billions of data points. As organizations scale their knowledge bases and handle concurrent user queries, memory capacity and bandwidth become the primary bottlenecks.
These memory demands are compounded by the dynamic nature of RAG workloads. Memory requirements spike during data ingestion and search phases, with peak usage varying based on query volume and complexity. Simultaneous queries accessing large vector indexes require substantial memory bandwidth to maintain the low-latency responses users expect from AI applications.
Traditional server architectures face a fundamental constraint: GPU attached memory such as High Bandwidth Memory (HBM) and CPU attached memory such as DRAM are limited in capacity and cannot scale with the exponential growth of AI models and datasets due to space, thermal and cost constraints. Storage devices are used after system memory is consumed, but slower storage tiers add significant latency and impact AI performance. This “memory wall” forces difficult tradeoffs between performance, cost, and scalability. Below, Figure 2 represents an example topology. It results in:
- Limited concurrent LLM instances at high AI throughput
- Limited throughput with NVMe cache
- High latency with NVMe cache
- Lower GPU utilization on average with high CPU overhead
How Leo CXL Smart Memory Controller Solves The RAG Memory Wall
Astera Labs’ Leo Smart Memory Controller fundamentally changes the economics and performance profile of RAG deployments in several critical ways:
- Elastic Capacity at Scale: With Leo CXL controllers supporting up to 2TB per controller, organizations can scale vector database capacity well beyond the constraints of local CPU DIMM slots. This enables storing larger knowledge bases entirely in-memory rather than swapping to slower storage tiers—a critical advantage when query performance directly impacts user experience.
- Tiered Memory Architecture: CXL enables intelligent memory tiering strategies where frequently-accessed “hot” vectors reside in local DRAM while “warm” data lives in CXL-attached memory. This tiering maintains near-native performance for common queries while dramatically expanding total capacity, enabling organizations to support larger knowledge bases and more concurrent queries with fewer AI appliances.
- Cost-Optimized Scaling: By strategically allocating memory resources across local CPU DRAM and CXL-attached DRAM, organizations can increase utilization of AI platforms and achieve the performance characteristics they need while reducing overall server costs. This becomes particularly valuable as RAG systems scale to support enterprise-wide deployments with thousands of concurrent users.
For organizations building production RAG systems, CXL provides a proven path to break through memory constraints that would otherwise limit the scale and responsiveness of AI-powered applications. With Leo CXL Smart Memory Controllers, we have demonstrated RAG to produce faster and more accurate results, which include:
- 3x concurrent LLM instances at higher AI throughput
- 3x increased throughput with CXL at higher user count
- 3x faster response time per instance with CXL
- Higher GPU utilization on average with low CPU overhead
Transforming LLM Inference with KV Cache
The efficiency and speed of responses of AI inference models depend on how many calculations it needs to perform. They often repeat many of the same calculations, which can slow things down. Key-Value caching is a technique that helps speed up this process by remembering important information from previous steps and avoiding recomputations, resulting in much faster and more efficient responses. To avoid recomputing attention values for previously processed tokens, LLM serving systems store key-value pairs in what’s called “KV cache.” The memory requirements for this process are staggeringly high.
Memory Bottlenecks for KV Caching
KV cache consumes a significant amount of memory for modern LLMs—approximately 1MB of memory per token. [2] As context windows expand to 100K+ tokens—enabling analysis of multiple documents or long conversations—a single inference request can require hundreds of gigabytes of KV cache. For perspective: analyzing extended documents or maintaining long conversation histories can easily consume 1TB or more of memory just for KV cache storage. [2]
Traditional memory scaling has several limitations such as:
- Inefficient compute (GPU & CPU) utilization
- CPU storage results in low performance due to high latency
- Overhead due to limited LLM instances per server
How Leo CXL Smart Memory Controller Solves KV Cache Memory Bottlenecks
Leo CXL memory provides a high-performance tier specifically suited for KV cache storage, with several breakthrough advantages:
- Efficient GPU Offloading: Based on real world inference workloads, CXL-CPU interconnect performs comparably to CPU-GPU interconnect for data transfer. [3] This means KV cache can be efficiently offloaded from constrained GPU memory to more abundant CXL memory without the significant latency penalties that would make this approach impractical.
- Increased Throughput with Consistent Performance: By storing KV cache in CXL memory rather than recomputing it, serving systems can handle 30% larger batch sizes while maintaining the same performance targets for time-to-first-token. [2] Larger batch sizes directly translate to higher throughput and better GPU utilization—critical metrics for production deployments.
- Dramatic Infrastructure Cost Reduction: Production modeling demonstrates that storing KV cache on CXL memory can reduce GPU requirements by up to 87%, with 75% higher GPU utilization for the prefill stage compared to full KV recomputation. [2] This fundamentally changes the economics of LLM deployment, allowing organizations to serve more users with less infrastructure investment.
- Enhanced Concurrent Processing: Real-world implementations show CXL enables systems to support 2× more concurrent LLM instances per server while reducing CPU utilization per query by 40%. [2] This means cloud providers and enterprises can serve more users with the same infrastructure investment, directly improving return on investment for AI deployments.
With Leo CXL Smart Memory Controller, AI Inference with KV cache is shown to have significant performance improvements:
- 67% lower latency with CXL
- 75% higher GPU utilization on average with low CPU overhead
- 2x more concurrent LLM instances on memory
The Path Forward: Cloud-Scale CXL Evaluation
What makes this moment particularly significant is that CXL is transitioning from research and proof-of-concept to cloud-scale deployment. Microsoft’s private preview of Leo CXL Smart Memory Controllers on Azure M-series VMs provides organizations with their first opportunity to evaluate CXL memory expansion capabilities for specific workloads in a cloud environment—validating the performance benefits demonstrated in research while advancing the technology toward broader adoption.
Leo controllers support CXL 2.0 with up to 2TB of memory capacity per controller, enabling server memory capacity to scale by more than 1.5×. [3] This capacity expansion directly addresses the memory bottlenecks that constrain RAG and KV cache performance in production deployments today.
Evaluation Criteria for Organizations:
For organizations evaluating CXL for their AI infrastructure:
- RAG Applications: Consider CXL for vector databases that exceed local DRAM capacity, particularly when serving high query volumes or maintaining multiple knowledge base variants. CXL enables scaling to larger knowledge bases while maintaining query performance that meets user expectations.
- LLM Inference: Evaluate CXL for KV cache storage when deploying models with long context windows (32K+ tokens) or when GPU memory constraints limit batch sizes and throughput. The ability to offload KV cache to CXL memory can dramatically improve GPU utilization and reduce infrastructure costs.
- Memory-Intensive Analytics: Explore CXL for in-memory databases and big data analytics where dataset sizes exceed traditional memory limits. The same principles that benefit RAG and LLM inference apply to any workload where memory capacity constrains performance.
Conclusion: Memory Innovation Enables AI Innovation
The collaboration between Astera Labs and Microsoft, along with the broader CXL ecosystem development on display at Supercomputing 2025, demonstrates that CXL is advancing rapidly toward addressing production AI workloads. As AI models grow larger and context windows expand, memory capacity and bandwidth increasingly determine what’s possible in production deployments. The memory wall is real, and it’s constraining the next generation of AI applications that enterprises want to build.
CXL technology—and specifically Astera Labs’ Leo CXL Smart Memory Controllers—provides a proven, production-ready path to break through these constraints. Whether you’re building RAG applications that need to search across vast knowledge bases, deploying LLMs that handle long-context inference, or running memory-intensive analytics at scale, Leo unlocks new levels of performance and cost efficiency.
To learn more about how Leo CXL Smart Memory Controllers can transform your AI infrastructure, explore our detailed technical specifications or discover our flexible CXL product suite.
Additional Resources:
- Research Paper: Tang, Y., et al. (2024). “Exploring CXL-based KV Cache Storage for LLM Serving ” – Machine Learning for Systems Workshop at NeurIPS 2024
- OCP CXL Technical Discussions: Visit the Open Compute Project YouTube channel for CXL technical sessions and panel discussions
- AI Infrastructure Forum: Watch memory innovation sessions at AI Infrastructure Forum video library
- CXL Consortium Resources: Explore the latest CXL specifications and technical resources at https://computeexpresslink.org/
References
- MarketsandMarkets. (n.d.). AI inference market size, share & trends, 2025 to 2030. https://www.marketsandmarkets.com/Market-Reports/ai-inference-market-189921964.html
- Tang, Y., Cheng, R., Zhou, P., Liu, T., Liu, F., Tang, W., Bae, K., Chen, J., Xiang, W., & Shi, R. (2024). Exploring CXL-based KV cache storage for LLM serving [Paper presentation]. Machine Learning for Systems Workshop at NeurIPS 2024. https://mlforsystems.org/assets/papers/neurips2024/paper17.pdf
- Zhong, Y., Berger, D. S., Waldspurger, C., Wee, R., Agarwal, I., Agarwal, R., Hady, F., Kumar, K., Hill, M. D., Chowdhury, M., & Cidon, A. (2024, July 10–12). Managing memory tiers with CXL in virtualized environments [Paper presentation]. 18th USENIX Symposium on Operating Systems Design and Implementation, Santa Clara, CA, United States. https://www.usenix.org/system/files/osdi24-zhong-yuhong.pdf