PCIe

Extending Our Connectivity Leadership: Industry’s First End-to-End PCIe over Optics Demo

June 11, 2024 by Lori Zielinski

By Jitendra Mohan, CEO and Co-Founder

The Generative AI revolution is reshaping all industries and redefining what’s possible in every aspect of our lives. Behind the scenes, the rapid pace of innovation is creating significant challenges for data center infrastructure, including:

Exploding demand for AI processing resources that must be interconnected across the data center due to the need for Large Language Models to simultaneously process massive multi-modal datasets (text, images, audio, and video).
Large number of platform architectures being deployed at a significantly accelerated, annual upgrade cadence due to the diversity and customization of Generative AI applications.
Maximum uptime and utilization of deployed AI infrastructure due to the incredible financial pressure on Cloud providers to deliver significant ROI for massive capital expenditure.

The compute requirements of modern AI models can only be met by interconnecting thousands of GPUs or AI accelerators together with a dedicated network/fabric that is purpose-built for AI workloads. This network, often referred to as the “backend” network, consists of a “scale-up” accelerator clustering fabric and a “scale-out” networking fabric. Scale-up fabric is often an any-to-any mesh interconnect optimized for maximum throughput and the ability to tightly couple the accelerators to rapidly exchange AI model training/inferencing data. Examples of scale-up interconnects include NVLink, Infinity Fabric, PCI Express® (PCIe®), Ethernet, etc. These technologies are used to connect up to hundreds of accelerators. The PCIe interface is natively available on AI accelerators and GPUs, and some of the AI platforms also leverage PCIe or PCIe-based protocols for scale-up fabrics. As the size of AI clusters is increasing from 1-2 racks with tens of GPUs to large pods spanning multiple racks and hundreds of GPUs, interconnect length quickly becomes a limitation. At PCIe 5.0 data rates, Active Electrical Cables (AECs) spanning up to 7 meters are sufficient to interconnect a few racks. However, at higher data rates like PCIe 6.x and PCIe 7.x, optical solutions are needed for GPU clusters spanning multiple racks.

We are excited to continue Astera Labs’ leadership in PCIe connectivity solutions by demonstrating end-to-end PCIe/CXL over optics for GPU clustering, which is lighting the path forward for scaling Generative AI infrastructure!

Addressing Everyday AI-interconnect Challenges

Since 2017, Astera Labs has been laser-focused on unleashing the full potential of AI and Cloud infrastructure through a consistent drumbeat of releasing first-to-market, highly innovative connectivity solutions. The foundation of our Intelligent Connectivity Platform is PCIe®, CXL®, and Ethernet semiconductor-based solutions, along with our COSMOS software suite of system management and optimization tools. This platform delivers a software-defined architecture that is both scalable and customizable.

Facing major challenges in Generative AI infrastructure build-outs, all major hyperscalers and AI platform providers utilize our Intelligent Connectivity Platform which is proven to:

Deliver reliable connectivity over distance and scale including chip-to-chip, box-to-box, rack-to-rack; and now, we deliver the capability to extend this to row-to-row with PCIe over optics in order to accelerate the deployment of the largest GPU clusters that must scale across the data center.
Accelerate time-to-deployment of vastly diverse AI platforms through our software-defined architecture and tremendous investment in up-front, interoperability testing at cloud-scale.
Enable unprecedented visibility into the ever-increasing number of connectivity links through deep diagnostics, telemetry, and fleet management that facilitate maximum uptime and system utilization of expensive AI infrastructure.

The product families that anchor our Intelligent Connectivity Platform include:

Aries PCIe/CXL Smart DSP Retimers are field-tested and widely deployed by all major hyperscalers and platform providers. Our third generation Aries 6 Retimers double the bandwidth to 64GT/s per lane support and are now sampling. Our Aries PCIe/CXL Smart Cable Modules™ (SCM) deliver an industry-first 7-meter reach over Active Electrical Cables (AECs) for rack-to-rack PCIe connectivity.

Taurus Ethernet Smart Cable Modules (SCM) support Ethernet rates up to 100Gb/s per lane in various form-factors that support robust, thin, and flexible cables for Switch-to-Switch and Switch-to-Server connectivity applications.

Leo CXL Smart Memory Controllers are the industry’s first solution to support CXL memory expansion, pooling, and sharing. They are optimized to meet the growing computational needs of Generative AI workloads at low latency.

We have a rich history of introducing and providing solutions early in the technology cycle to maximize platform utilization. This includes our first-to-market PCIe and CXL solutions and the expansion of our comprehensive Cloud-Scale Interop Lab, which enables confidence in deploying advanced solutions at scale.

New Paradigm for Seamless AI Connectivity

As AI infrastructure build-outs grow beyond single racks and exceed the reach of traditional passive Direct Attach Cables (DACs), new connectivity solutions must be developed. Signal loss at higher speeds also limits the effectiveness of passive solutions, demanding new active cables with improved reach and routing to complement the passive offerings.

Our Aries PCIe/CXL SCM™ delivers 7 meters reach over Active Electrical Cables (AECs), addressing the limitations of DACs. These cost-effective AECs with low latency enable thin cabling to easily scale AI accelerator clusters beyond a rack.

As data rates increase to PCIe 6.x (64GT/s), PCIe 7.x (128GT/s) and beyond, conventional passive and Active Electrical Cables will be limited to single racks. New solutions like PCIe over optics, including Active Optical Cables (AOCs), will play a larger role in rack-to-rack connectivity to maintain and grow these AI clusters.

Unleashing the Reach of Optical Connectivity for PCIe

Fiberoptic links have become the proven backbone of high-speed Ethernet connectivity, offering long reach data connectivity to cover hyperscale data centers with lightweight, thin optical cables. These benefits can be applied to PCIe connectivity by developing new PCIe over optics solutions, including AOCs, that extend PCIe connectivity to clusters of racks with improved cable management compared to copper.

The application of PCIe/CXL over optics is often driven by low-latency requirements relative to Ethernet, such as cache-coherent memory transactions and parallel processing workloads between GPUs. These applications also demand comprehensive management of the link through the use of specialized software to ensure full protocol compliance and reliability.

Astera Labs provides field-proven, software-defined connectivity solutions developed over multiple generations of PCIe specifications to seamlessly integrate PCIe over optics. We have demonstrated this in end-to-end, fully compliant link connections representative of AI infrastructure deployment use cases. Our PCIe over optics demo includes a CPU head-node root complex to a target GPU and a target remote disaggregated memory system. The ability to demonstrate the first, long distance, fully compliant PCIe link over optics between multiple devices paves the way for new products such as high-speed PCIe over optics. Additionally, this solution leverages comprehensive diagnostics, telemetry, and fleet management features from our COSMOS software suite, which helps accelerate time to deployment and facilitates optimized infrastructure utilization.

In summary, Astera Labs continues to innovate and execute on new connectivity solutions that support the accelerated deployment of AI platforms needed to keep up with rapid advancements of next-generation Generative AI applications. Delivering solutions that utilize a software-defined architecture that builds on our common COSMOS software suite enables flexible and reliable connectivity that spans chip-to-chip, box-to-box, rack-to-rack, and now, with the demonstration of PCIe over optics, row-to-row applications across the data center. This is valuable for hyperscalers and AI platform providers as the integration of diagnostics and telemetry for infrastructure management is a significant investment and can be fully leveraged with our latest PCIe over optics technology. We’re excited to be the first to showcase complete end-to-end optics demonstration with a PCIe 5.0 GPU and a CXL 2.0 memory expander connected to a root complex using PCIe based optical modules.

Additional Resources

The Long and Short of AI: Building Scalable Data Centers in the PCIe 6.x Era

May 17, 2024 by Joe Balich

By Abhishek Wadhwa, Senior Field Applications Engineer

The rise of artificial intelligence (AI) and Generative AI are transforming how we interact with technology. From healthcare to business efficiency and groundbreaking research, AI and Generative AI are making waves. These AI marvels rely on vast amounts of hardware and infrastructure to function. As such, data centers are undergoing a revolution driven by the ever-growing demands and workloads required to train AI and Generative AI.

As shown in the left of Figure 1 below, early data center architectures consisted of racks of servers, each with a fixed amount of processing power and memory. As the workload requirements of the data centers shifted, this rigid setup led to resource stranding, where some servers had idle resources while others were maxed out [1]. In the past decade, servers were partially disaggregated to optimize workloads. This allowed data center operators to separate compute, memory, and storage resources into individual building blocks as shown in the center rack of Figure 1. These blocks are dynamically allocated to workloads, maximizing utilization and efficiency. Disaggregation empowered data centers to adapt and scale seamlessly with changing workloads. Now, the AI based systems shown in the right of Figure 1 are being used to run efficient AI workloads.

Figure 1: Evolution of rack architecture from blade systems to AI/ML systems

PCIe Connectivity in an AI System

PCI Express® (PCIe®) technology is the workhorse of data center communication by offering high bandwidth, low latency, efficient data exchange between CPUs, GPUs, and other components. Current AI architectures are deployed with PCIe 5.0, which offers speeds of 32 GT/s @ 16 GHz. However, a Nyquist frequency of 16 GHz has high loss and limits the signal reach for PCIe 5.0 to 10-12 inches on an ultra-low-loss PCB material [2]. To ensure smooth communication with PCIe 5.0 architecture, system designers leverage Retimers to compensate considerable channel insertion loss and extend signal reach.

This architectural design shift changed PCIe connectivity from short to long reach. Figure 2 below represents a typical AI system. These systems can vary in design, but they typically involve connecting a head node server directly via PCIe to one or more “Just a Bunch of GPUs” (JBOG)/AI Accelerator baseboards which holds up to eight GPUs. The head node shown in Figure 2 is a two-socket server utilizing Aries 5 PCIe 5.0/CXL® Retimers for box-to-box server communication. A Host Interface board (HIB) connects the head node with the JBOG. Another set of Aries 5 Retimers facilitate robust communication between the HIB and JBOG.

Figure 2: AI system

The Long Reach – Scaling up GPU Clusters

Scaling up AI model training requires distributing the workload across thousands of GPUs. This technique, called data parallelism, necessitates efficient communication between GPUs for exchanging data. While PCIe interconnectivity within a JBOG enables data exchange within a single box, power limitations come into play when scaling up the GPU cluster.

The current generation of a single AI system can draw around 10 kW, limiting the number of AI systems deployed in a rack with a typical power density of 15-20 kW per rack [3]. To accommodate these power limitations, one approach distributes AI systems across multiple racks, spreading the power load more effectively (Figure 3). However, this approach creates a new challenge of efficiently connecting these distributed GPUs for scale-up data exchange.

PCIe connections with passive cables have a limited reach of approximately three meters for PCIe 5.0 (Figure 3) and higher bend radius, limiting the cable reach within the rack. Aries PCIe/CXL Smart Cable Modules (SCM) with built-in Retimers extend the reach up to seven meters for PCIe 5.0. These thinner, more flexible cables with increased bend radius also improve the rack’s airflow. This reach and flexibility enabled by Aries SCMs opens a new possibility of connections across multiple racks tying GPU clusters and AI Accelerators together to reduce power density.

Figure 3: Multi-rack GPU clustering

The Short Reach – Doubling Bandwidth with PCIe 6.0

As AI models double in computation every six months, the bandwidth for their collective training process needs an upgrade too [4]. PCIe 6.0 offers double the bandwidth compared to PCIe 5.0, pushing data rates up to 64 GT/s per lane. This leap presents even greater signal integrity (SI) challenges when compared to PCIe 5.0. Figure 4 shows the NRZ signal (left) used by PCIe 5.0, and the PAM-4 signal (right) used by PCIe 6.0. PCIe 6.0 retains the Nyquist rate of 16 GHz and encodes two bits per unit interval (UI) to double the data rate. PAM-4 has four voltage levels and three eyes compared to two levels and one eye in NRZ signaling.

Since the overall voltage swing remains fixed, each eye in the PAM-4 system offers only one-third of the voltage available in NRZ as shown in the image below. This also renders the signal integrity far more susceptible to any noise during data transmission. Smaller PAM-4 eyes result in SNR degradation by 9.5 dB when compared to NRZ, further affecting the SI [5]. The complex signaling and higher data rates in PCIe 6.0 result in increased jitter, which can degrade signal quality, leading to errors. The system simulation budget for common clock architectures was trimmed to 0.15 ps_RMS in PCIe 6.0 from 0.25 ps_RMS in PCIe 5.0.

Figure 4: PCIe 6.0 implements PAM4 signaling resulting in tougher signal integrity challenges

The increase in the PCIe 6.0 data rate also leads to a reduction in the channel loss budget, making it more difficult to maintain signal integrity over longer distances. The channel loss budget for PCIe 6.0 is 32 dB, compared to 36 dB for PCIe 5.0, forcing designers to be more diligent at managing signal loss in their systems. Board design complexity increases due to limited rise time and transition amplitudes, requiring complex equalization and clock recovery mechanisms. Much like the PCIe 5.0 server architecture shown above, the Aries 6 Retimer aids in overcoming vital channel loss hurdles, serving as a powerful ally in dealing with board design complexities.

Aries Retimers – Extending PCIe 6.0 Reach

Figure 5 shows PCIe 6.0 signal reach for a traditional add-in-card topology with and without a Retimer. Without a Retimer, in PCIe 6.0 topologies, system board reach is limited to 4.3 inches for low-loss PCB material and 7.1 inches for ultra-low-loss PCB materials [6]. With a Retimer, this reach is more than doubled to 9.9 inches and 15.1 inches for a low-loss and ultra-low-loss board material respectively.

Figure 5: Adding a Retimer to the system increases signal reach, enabling additional design form-factors

Aries 6 Retimers are the gold standard solution for extending signal reach, offering the lowest risk path to market. Third generation Aries 6 represents the culmination of years of learnings from cloud-scale deployments with all major hyperscalers and AI platform providers. Aries 6 provides wide interoperability support, robust signal integrity, and SerDes and DSP optimized for demanding AI server channels. It also features enhanced diagnostics and telemetry via COSMOS (COnnectivity System Management and Optimization Software). Further enhancing its adaptability, Aries’ COSMOS and firmware eliminate the need for silicon re-spins, offering customizability for each platform. This makes Aries 6 the ideal choice for those seeking a reliable, efficient, and interoperable Retimer solution.

As next generation data centers disaggregate AI systems to support high power densities, connectivity becomes a critical bottleneck. While PCIe 6.0 technology offers double the bandwidth, it also introduces significant SI challenges. Overcoming these hurdles requires innovative solutions like Aries 6 Retimers and Aries Smart Cable Modules. As AI continues to evolve, ensuring efficient communication across distributed AI Accelerator baseboards will be crucial for unlocking its full potential. The future of data centers hinges on developing robust and scalable communication architectures that can keep pace with the ever-increasing demands of AI workloads.

References:

1. Lin, Y. Cheng, M. D. Andrade, L. Wosinska and J. Chen, “Disaggregated Data Centers: Challenges and Trade-offs,” in IEEE Communications Magazine, vol. 58, no. 2, pp. 20-26, February 2020, doi: 10.1109/MCOM.001.1900612.
2. https://www.asteralabs.com/smart-retimers/simulating-with-retimers-for-pcie-5-0/
3. https://www.electronicdesign.com/blogs/the-briefing/article/21278725/electronic-design-can-silicon-supply-enough-power-for-the-future-of-ai-silicon
4. https://www.visualcapitalist.com/cp/charted-history-exponential-growth-in-ai-computation
5. Intel’s AN 835: PAM4 Signaling Fundamentals
6. PCI-SIG. [T3S11 Design Considerations for PCIe6.0 Retimers by Casey Morrison]. Presented at PCI-SIG Conference.

The Generative AI Impact: Accelerating the Need for Intelligent Connectivity Solutions

May 30, 2023 by Lori Zielinski

We have entered the Age of Artificial Intelligence and Generative AI is developing at a rapid pace and becoming integral to our lives. According to Bank of America analysts, “just as the iPhone led to an explosion in the use of smartphones and phone apps, ChatGPT-like technology is revolutionizing AI”.[1] Generative AI is changing every aspect of our lives including education, healthcare, entertainment, customer service, legal analysis, software development, engineering, manufacturing, and more.

A key Generative AI performance breakthrough is the transformer: a machine learning architecture that better understands context. These transformers enable large language models (LLMs) including OpenAI ChatGPT and DALL-E, Microsoft Bing/Copilot, and Google Bard to create human-like text interaction, artistic images, and expert-level personal assistants – capabilities that seemed futuristic just months ago.

The race to deploy and build on these tools is predicted to drive robust demand for processors, memory, and connectivity chips that are used for training and running these AI models. This represents a significant opportunity for savvy tech companies and investors. According to a May 2023 report by Next Move Strategy Consulting, the global AI chip market is expected to increase from $28.83B in 2022 to $304.9B in 2030 and grow at a CAGR of 29.0%.

Source: Next Move Strategy Consulting

AI Compute Outpacing Moore’s Law

In order to achieve great performance, general LLMs require extensive pretraining, using large datasets. For example, ChatGPT was reportedly trained on 570GB of text, taking 34 days despite running on 10,000 GPUs.[2] [3] As models become multi-modal (able to process text, imagery, and audio), training data size for general LLMs may increase.

The computational power required to train large models on big datasets is rapidly outpacing Moore’s Law. While single GPU performance is increasing by about 2x every 18-24 months (Moore’s Law plus architectural improvements), training dataset and AI model size are growing by 10x and 35x, respectively, in the same time frames.[4] If the training needs to be completed in a reasonable number of days, the power of dozens or even 1000s of GPUs may be required.

Source: Sevilla, Jaime, et al. “Compute trends across three eras of machine learning

Fortunately, transformer training lends itself to being parallelized as it can run on multiple GPUs. Using this parallel processing technique allows Generative AI models like ChatGPT to learn a great deal about our world in a matter of weeks.

Disaggregated parallel compute architectures, however, still require processing nodes to share vast amounts of data in real-time with each other. This creates a bottleneck at the interconnects between processing as well as memory components. These connectivity bottlenecks become more challenging as hardware scales and the physical distance between processing, memory and switching modules grows.

Overcoming Connectivity Bottlenecks

Astera Labs delivers semiconductor-based connectivity solutions purpose-built to unleash the full potential of intelligent data infrastructure at cloud-scale. To date, we’ve developed three class-defining, first-to-market product lines that deliver critical connectivity for high-value artificial intelligence and machine learning applications.

Our connectivity solutions enable disaggregated, heterogeneous Generative AI architectures that reduce cost and power compared to optical solutions. Our Aries Smart Retimers extend the reach for PCIe NICs, GPUs, FPGAs, and NVMe drives within servers. Our Leo Memory Connectivity Platform enables CXL-attached memory expansion, pooling, and sharing for cloud servers. And, our Taurus Smart Cable Modules are optimal for Ethernet connectivity for switch-to-switch and switch-to-server topologies. We’ve also included support for fleet management capabilities and health monitoring diagnostics to predict failures before they happen and to increase multi-tenant AI system performance and uptime, ease re-provisioning, and lower total cost of ownership (TCO).

Learn more about our intelligent connectivity solutions:

Aries PCIe®/CXL™ Smart Retimers enable generative AI applications by improving signal integrity, extending reach up to 3x and supporting high bandwidth communication between GPUs or accelerators required to manage the complex datasets involved in training the large language models.
Leo CXL Memory Connectivity Platform eliminates memory bandwidth and capacity bottlenecks by enabling up to 2TB more memory per CPU. Leo is optimized for meeting complex computational needs of generative AI workloads at low latency.
Taurus Ethernet Smart Cable Modules™ remove rack-level ethernet bottlenecks by overcoming reach, signal integrity, and bandwidth utilization issues in 100G/Lane Ethernet Switch-to-Switch and Switch-to-Server applications. By enabling thinner cables and supporting gearboxing and breakouts, these modules are optimized for high-density, high-throughput AI/ML rack configurations.

Accelerating Time-to-Market

At Astera Labs, we understand that supporting plug-and-play interoperability is critical for deploying these new architectures at scale. In our industry-first Cloud-Scale Interop Lab, we rigorously test our portfolio for standards compliance and system-level interoperation with all major hosts, endpoints, and memory modules – so our customers can deploy with confidence and accelerate time-to-market.

Conclusion

The Age of AI is here, and Astera Labs’ intelligent connectivity solutions for data infrastructure can help you seize the booming Generative AI opportunity. Contact us to learn more.

References

1] Artificial intelligence like ChatGPT is on the brink of an ‘iPhone moment’ thanks to ‘warp-speed’ development, Bank of America says, Fortune Magazine, Prakash, March 1, 2023

2] ChatGPT and generative AI are booming, but the costs can be extraordinary, CNBC, Vanian/Leswing, March 13, 2023

3] Update: ChatGPT runs 10k Nvidia training GPUs with potential for thousands more, FierceElectronics, Hamblin, February 11, 2023

4] Sevilla, Jaime, et al. “Compute trends across three eras of machine learning.” 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022.

Cloud-Scale Infrastructure Fleet Management Made Easy with Aries Smart Retimers

February 28, 2023 by AsteraAdmin

Data centers today have a lot of servers, and within each server there is an abundance of storage, specialized accelerators, and networking/communications infrastructure. These represent tens of thousands of interconnected systems, and with the rise of hyperscalers and cloud service providers, the scale of data infrastructure is only expected to grow in the years to come.

To get the most performance and uptime out of their data centers, Astera Labs’ customers are deploying its Aries Smart Retimers, which support both PCIe® 4.0 and 5.0, and effectively remove data center bottlenecks by extending physical reach by up to 3x with <10ns latency. Aries is also ideal for Compute Express Link™ (CXL™) applications where latency performance is even more critical.

These customers must also deploy robust fleet management capabilities to optimize data center performance and total infrastructure uptime – ensuring that all servers are running at peak level while predicting potential points of failure.

Going beyond channel reach extension, Astera Labs’ Aries Smart Retimers deliver deep diagnostics, enabling a powerful array of link health monitoring tools for data center server fleet management. The Aries Software Development Kit (SDK), deployed on the baseboard management controller (BMC) enables large-scale monitoring and resource optimization, allowing customers to gain detailed analytics from thousands of datapoints on how their links are performing in real-time.

Real-time monitoring for resource optimization, predictive failure, and more

With Aries PCIe Smart Retimers deployed on servers, storage systems, accelerator trays, and other equipment, BMC applications can make use of real-time link health monitoring to impact resource allocation decisions during provisioning. As you can see in the example below, using data gathered in real-time by the Aries SDK, customers can monitor link health and predict failure, enabling them to make key decisions toward maximizing their PCIe bandwidth.

The Aries SDK also enables critical capabilities beyond fleet management. Customers can use the same software for automated validation, requiring less customer time on the bench. We’ll cover this topic in more detail in a future blog.

Learn more – download the Fleet Management Made Easy white paper today

Any component placed in a critical PCIe 4.0/5.0 data path must have robust performance which is monitorable and recoverable. When operating a large data center with tens of thousands of servers, storage boxes, and GPU/accelerator trays, it is imperative to know which resources are in a healthy state—and can therefore be dynamically assigned to customer workloads—and which resources require maintenance. This capability is critical to infrastructure efficiency and is impactful to the make-or-break Total Cost of Ownership (TCO) calculation in data center operations.

Data collection is key to solving this resource optimization challenge. Request the Fleet Management Made Easy white paper to learn how the Aries SDK plays a key role in monitoring the critical PCIe infrastructure.

PCI Express® 5.0 Architecture Channel Insertion Loss Budget

June 11, 2020 by AsteraAdmin

The upgrade from PCIe® 4.0 to PCIe 5.0 doubles the bandwidth from 16GT/s to 32GT/s but also suffers greater attenuation per unit distance, despite the PCIe 5.0 specification increasing the total insertion loss budget to 36dB. After deducting the loss budget for CPU package, AIC, and CEM connector, merely 16dB system board budget remains. Within the remaining budget, engineers need to consider safety margin for board loss variations due to temperature and humidity.

By Liang Liu, Systems and Applications Engineer, Astera Labs – PCI-SIG^®Member

Written by Astera Labs for PCI-SIG

PCI Express^® (PCIe^®) technology is the most important high-speed serial bus in servers. Due to its high bandwidth and low latency characteristics, PCI Express architecture is widely used in various server interconnect scenarios, such as:

Within a Server: CPU to GPU, CPU to Network Interface Card, CPU to Accelerator, CPU to SSD
Within a Rack: CPU to JBOG and JBOF through board-to-board connector or cable
Emerging GPUs-to-GPUs or Accelerators-to-Accelerators interconnects

At the same time, with the rapid development of heterogeneous computing, the data throughput requirements in the server system are becoming higher and higher. Two years after the release of the PCIe 4.0 specification, the PCIe 5.0 specification was officially released in May 2019. PCIe 5.0 technology still uses the same 128b / 130b coding scheme, and the symbol rate increased from 16 GT/s to 32 GT/s. In keeping with tradition, the PCIe 5.0 specification is backwards compatible with lower-speed PCIe generations.

The Challenge of PCIe 5.0 Technology Design

In the case of other standards greater than 30 GT/s, the PAM-4 modulation method is usually used to make the signal’s Nyquist frequency one-quarter of the data rate, at the cost of 9.5 dB signal-to-noise ratio (SNR). However, PCIe 5.0 architecture continues to use the non-return-to-zero (NRZ) signaling scheme, thus the Nyquist frequency of the signal is one-half of the data rate, which is 16 GHz. The higher the frequency, the greater the attenuation. The signal attenuation caused by the channel insertion loss (IL) is the biggest challenge of PCIe 5.0 technology system design.

PCIe 5.0 specification outlines the bump-to-bump IL budget as 36 dB for 32 GT/s, and the bit error rate (BER) must be less than 10^-12. To address the problem of high attenuation to the signal, the PCIe 5.0 specification defines the reference receiver such that the continuous-time linear equalizer (CTLE) model includes an A^DC (adjustable DC gain) as low as -15 dB, whereas the reference receiver for 16 GT/s is only -12 dB. The reference decision feedback equalizer (DFE) model includes three taps for 32 GT/s and only two taps for 16 GT/s.

In addition, the possibility of errors on the serial link becomes higher as the data rate reaches 32 GT/s. Due to the significant role of the DFE circuit plays in the receiver’s overall equalization, burst errors are more likely to occur compared to 16 GT/s. To counteract this risk, PCIe 5.0 architecture introduces precoding in the protocol. After enabling precoding at the transmitter side and decoding at the receiver side, the chance of burst errors is greatly reduced, thereby enhancing the robustness of the PCIe 5.0 specification 32 GT/s Link.

PCIe 5.0 Technology Channel Insertion Loss Budget

Table 1 uses a typical system base board plus add-in card (AIC) application as an example to list the insertion loss budget for PCIe 4.0 architecture (16GT/s) and PCIe 5.0 architecture (32 GT/s). At 32 GT/s, after deducting 9 dB for the CPU package, 9.5 dB for the AIC, and 1.5 dB for the CEM connector, the remainder for the system base board is only 16 dB.

Table 1: 16 GT/s and 32 GT/s Channel Insertion Loss Budget
PCIe^® Rev	Total Channel Insertion Loss Budget	Root Package	CEM Connector	Add-in Card (AIC)	Remaining Budget for System Base Board
3.0 (8 GT/s)	22 dB	3.5 dB	1.7 dB	6.5 dB	10.3 dB
4.0 (16 GT/s)	28 dB	5.0 dB	1.5 dB	8.0 dB	13.5 dB
5.0 (32 GT/s)	36 dB	9.0 dB	1.5 dB	9.5 dB	16.0 dB

However, when looking at the 16-dB system base board budget, engineers need to consider the following factors:

As the PCB temperature rises, the IL of the PCB trace becomes higher
Process fluctuation during PCB manufacturing can result in slightly narrower or wider line widths, which can lead to fluctuations in IL
The amplitude of the Nyquist frequency signal (16-GHz sine wave in the case of 32 GT/s NRZ signaling) at the source side is 800 mV pk-pk, which will reduce to about 12.7 mV after 36 dB of attenuation. This underscores the need to leave some IL margin for the receiver to account for reflections, crosstalk, and power supply noise that all potentially will degrade the SNR.

Figure 1: Insertion Loss Budget for System Base Board

Thus, the IL budget reserved for the PCB trace on the system base board should be 16 dB minus some amount of margin which is reserved for the above factors. Many hardware engineers and system designers tend to leave 10-20% of the overall channel IL budget as margin for such factors. In the case of a 36-dB budget, this amounts to 4-7 dB.

As the demand for artificial intelligence and machine learning increases, PCIe 5.0 technology will enable more and more system topologies. The change from PCIe 4.0 architecture to PCIe 5.0 architecture brings the channel IL budget from 28 dB to 36 dB, which will bring new design challenges. By leveraging advanced PCB materials and/or PCIe 5.0 Retimers to ensure sufficient end-to-end design margin, system designers can ensure a smooth upgrade to PCIe 5.0 architecture.

Reference

AN 835: PAM4 Signaling Fundamentals https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an835.pdf