Inference Tokenomics: How CXL Memory Expansion Improves AI Economics

Amit Golander, PhD. Memory & Storage Associate Vice President

In our last post on Breaking Through the Memory Wall, we explored how CXL memory expansion addresses fundamental constraints in RAG and KV cache management. We expand on how inference application deployment has transitioned to a distributed architecture to address the requirements of large-scale applications, and how CXL-attached memory helps improve capacity and performance while lowering TCO.

Memory Bottlenecks in Distributed Inference

As generative models have grown larger, inference architectures have evolved to distribute workloads across multiple GPUs—separating LLM context processing and generation phases across distinct GPUs to improve performance and minimize costly re-computation. But this scale exposes a fundamental constraint: during inference, LLMs build context encoded into KV caches that grow rapidly as the number of tokens generated increases. This context quickly exhausts precious GPU memory, i.e., the Memory Wall. Without efficient memory expansion, latency rises, costs soar, and the end-user experience suffers.

Memory expansion becomes especially crucial in two everyday use cases:

  1. first, offloading heavy contexts during input document processing, where models must digest and reason over substantial information; and
  2. second, maintaining context across multiple conversation turns as users, whether human or AI agent, pose follow-on questions or requests.

Without nearby memories to maintain inference context efficiently, the consequences are clear: recompute that wastes hardware and power, inflated operational costs, and compromised end-user experiences.

Why KV Cache Offloading Drives Inference ROI

KV cache offloading addresses this challenge by expanding GPU memory to nearby higher-capacity resources. Traditional distributed inference applications offload KV cache to CPU DRAM or SSDs, but these memory tiers are limited, and may introduce latency, which can degrade user experience and limit the number of inference requests each GPU can handle.

CXL-attached memory in a distributed inference architecture provides a superior alternative: higher performance than traditional storage while maintaining cost-effectiveness. Leo-based solutions deployed with Penguin Solutions demonstrate 75% higher GPU utilization and 2X inference throughput, directly translating into improved inference economics per deployed GPU.

Flexible Deployment Options

Astera Labs’ Leo Memory controllers provide flexible solutions to the KV cache challenge. Organizations can deploy Leo-based memory modules in either:

  • an intra-GPU server configuration, expanding memory via CPU-CXL connectivity within a GPU server, or
  • an inter-GPU server architecture using Leo in shared KV cache servers accessed via RDMA, as demonstrated in Penguin Solutions’ GTC booth (also see session EX82068).
CXL-enabled servers at Penguin Solutions booth 1031 demonstrate 3.6x memory expansion, accelerating KV cache for AI inference workloads and boosting GPU utilization.

This deployment model allows infrastructure architects to build a distributed KV cache infrastructure that serves multiple GPU servers, providing both the capacity and performance needed for large-scale distributed inference without the latency penalties of traditional storage approaches.

Open Ecosystem Integration

Leo’s architecture supports topology-aware deployments that scale with distributed inference workloads, and integrates with existing inference frameworks, including vLLM, TensorRT-LLM, and SGLang, enabling deployment without application-level changes. This open-source compatibility ensures integration with existing AI stacks without costly migration projects.

While Leo’s AI inference benefits are clear, the technology delivers value across diverse applications, including Microsoft Azure’s M-series VM instances running SAP HANA, leverage Leo for memory expansion in traditional enterprise applications.

Looking Forward

As distributed AI inference continues scaling across rack-scale architectures, Astera Labs remains focused on advancing CXL memory solutions that address the core challenge: keeping GPUs fully utilized while managing inference economics. Stay tuned to Astera Labs for future updates on scaling AI infrastructure without hitting the memory wall.


Visit us at GTC 2026: See Leo-based KV cache offload demonstrations at Penguin Solutions booth 1031. Learn how CXL memory expansion can optimize your inference deployment economics.

About Amit Golander, PhD. Memory & Storage Associate Vice President

Dr. Amit Golander has spent decades working on computer infrastructure—building, researching, and pushing the limits of high‑speed storage, compute, memory, and fabrics. He’s led global teams across three startups and four large companies, and has published roughly 90 papers and patents. In his spare time, Amit teaches AI Infrastructure and Storage and mentors M.Sc. students at Tel‑Aviv University.

Share:

Related Articles