400G/800G Ethernet FAQ
- At 50Gbps/lane, passive direct-attach copper (DAC) cables barely reach 3-meters. At 100Gbps/lane, DACs may only have a 2-meter practical reach limit.
- The switch PCB consumes too much of the channel budget, which then limits the cable reach and increases cable gauge.
- DACs are rigid, heavy, and bulky, restricting airflow for system cooling and making rack servicing difficult.
- Optical modules have high power consumption. A 400G module consumes around 12W and a 800G module may consume up to 20W.
- Optical modules require advanced low loss materials which are expensive.
- Optical modules have a shorter lifespan and are less reliable compared to active copper cables. Data center operators need to constantly maintain and replace the failed optical modules.
- Active Optical Cables (AOC) can be used for rate conversion and to achieve thin wire profile. However, such optical designs incur additional costs, reliability concerns, and require more power.
- Active Copper Cables (ACC) can be used for rate conversion, have a lower design cost when compared to AOC while also supporting even thinner gauge cabling as compared to passive DACs. General purpose ACCs are limited by their lack of diagnostics and security features.
- Smart Electrical Cables (SEC) that utilize Taurus Smart Cable Modules have all the benefits of an ACC with the added “”smarts”” required by Cloud Service Providers.
- Switch-to-server: ToR Switches to Network Interface Cards (NIC) interconnects on a server.
- Switch-to-switch: within a spine switch and spine switch to Exit Leaf interconnects.
Taurus Smart Cable Modules can provide gearbox functionality at 200GbE from 4x50G to 8x25G.
- Rate mismatches between NIC and switch lead to wasted switch bandwidth.
- Traditional DAC interconnects are too short, thick and bulky to handle high speed ethernet signals between ToR switches and multiple racks.
Smart Electrical Cables (SEC) support longer reach and thinner cabling while adding security and diagnostic capability.
A Taurus Smart Cable Module with gearbox capability can be used on the NIC to resolve the per-lane rate disparity and reduce the end-to-end channel loss, thereby increasing the cable reach and/or reducing cable gauge.
In an average 3m 34 AWG copper wire, the typical channel loss is about 28dB at 12.9GHz, but might be as high as 36dB during worst cases.
Taurus Smart Cable Modules’ advanced fleet management capabilities include Full CMIS Features, Security, and Extensive Diagnostics (Cable Degradation Monitoring, Host-Cable Security, Multiple Loopback Modes, and Pattern Generation/Checking)
Taurus offers various firmware and setting updates to adapt to diverse system topologies, including firmware flexibility, in-field upgrade support, health monitoring and debug, and CMIS extension.
A user can update module management functions, adaption algorithms, and full-module firmware even after the cable is deployed to the switch system.
We offer complete CMIS Firmware update procedures in the product datasheet.
- NRZ is a modulation technique that has two voltage levels to represent logic 0 and logic 1. PAM4 uses four voltage levels to represent four combinations of two bits logic – 11, 10, 01, and 00.
- PAM4 has the advantages of halving the Nyquist frequency and doubling the throughput for the same Baud rate. This alleviates the need for designers to have to invent infrastructure like silicon and cables that would go up to 50GHz bandwidth.
- The SNR loss of a PAM4 signal compared to an NRZ signal is ~9.5 dB.
Smart Retimer FAQ
Astera Labs Aries PCIe Smart Retimers offer exceptional robustness, ease-of-use and a list of Fleet Management capabilities. Get more details >
There are generally three ways to approach this:
- Channel Loss Budget Analysis
- Simulate channel s-parameter in the Statistical Eye Analysis Simulator (SeaSim) tool to determine if post-equalized eye height (EH) and eye width (EW) meet the minimum eye opening requirements: ≥6 mV EH and ≥3.13 ps EW at Bit Error Ratio (BER) ≤ 10-6. Refer to PCIe Base Specfication Section 8.5.1.
- Consider your cost threshold for system upgrades
A redriver amplifies a signal, whereas a Retimer retransmits a fresh copy of the signal.
For PCIe 6.x, 36 dB at 16 GHz pre–channel and 36 dB at 16 GHz post– channel. Based on the PCIe Base Specification, the maximum total insertion loss with one retimer from Root Complex to End Point is 32 dB at 16 GHz, die to die.
There is no need to fine tune a Retimer EQ setting as it participates in Link Equalization with Root Complex and End Points and automatically fine tunes the receiver EQ.
The maximum number to cascade Retimers in a link is 2, which is defined in PCIe specification.
There are no “special” considerations. During Equalization Phase 2, the Retimer’s upstream pseudo port (USPP) and the Endpoint will simultaneously train their receivers, and they have a total 64 ms at 64 GT/s speed (32 ms at lower speeds) to complete their Phase 2 training. During Equalization Phase 3, the same will happen with the downstream pseudo port (DSPP) and the root complex, and likewise a total of 64 ms at 64 GT/s speed (32 ms at lower speeds) is provided to complete Phase 3 training. The timeouts are the same regardless of whether a Retimer is present or not.
Not quite, each port of a packet switch has a full PCIe protocol stack:
Physical Layer, Data Link Layer, and Transaction Layer.
A packet switch has at least one root port and at least one non-root port.
A Retimer, by contrast, has an upstream-facing Physical Layer and a downstream-facing Physical Layer but no Data Link or Transaction Layer.
As such, a Retimer’s ports are considered pseudo ports because a Retimer does not have — nor does it need — these higher-logic layers, the latency through a Retimer is much smaller compared to the latency through a packet switch.
The only notable differences are:
- As with all PCIe 6.x transmitters, the Retimer’s transmitters must support 64 GT/s precoding when requested by the link partner.
- As with all PCIe 6.x receivers, the Retimer’s receivers must support Lane Margining in both time and voltage.
A Retimer is required to have the same link width on its upstream-facing port and on its downstream-facing port. In other words, the link widths must match. A Retimer must also support down-configured link widths, but the width must always be the same on both ports.
Redrivers are not defined or specified within the PCIe Base Specification, so there are no formal guidelines for using a Redriver versus using a Retimer.
A Retimer’s transmitters and receivers, on both pseudo ports, must meet the PCIe Base Specifications. This means that a Retimer can support the full channel budget (nominally 36 dB at 16 GHz) on both sides — before and after the Retimer. Calculating the insertion loss (IL) budget should be done separately for each side of the Retimer, and channel compliance should be performed for each side as well, just as you would do for a Retimer-less Root-Complex-to-Endpoint link.
Redrivers and Retimers are active components which impact the data stream: their package imposes signal attenuation, their active circuits apply boost, and (in the case of Retimers) clock and data recovery. As such, there is no way to truly disable these components and still have data pass through. When disabled, no data will pass through a Redriver or Retimer.
- Determine if a Retimer is needed based on different PCB materials
- Define a simulation space, and identify worst-case conditions (temperature, humidity, impedance, etc.), minimum set of parameters (e.g., Transmitter Presets)
- Define the evaluation criteria, such as minimum eye height/width
- Execute and analyze results
View Signal Integrity Challenges for PCIe 5.0 OCP Topologies Video >
Bit error rate (BER) is the ultimate gauge of link performance, but an accurate measure of BER is not possible in relatively short, multi-million-bit simulations.
Instead, this analysis suggests the following pass/fail criteria, which consist of two rules:
-
- A link must meet the receiver’s eye height (EH) and eye width (EW) requirements
- A link must meet criteria 1 for at least half of Tx Preset settings (≥5 out of 10)
- Criteria 1 establishes that the there is a viable set of settings, which results in the desired BER. The specific EH and EW required by the receiver is implementation-dependent.
- Criteria 2 ensures that the link has adequate margin and is not overly sensitive to the Tx Preset setting.
View Signal Integrity Challenges for PCIe 5.0 OCP Topologies Video >
Use IBIS model and time domain simulations.
There are two primary applications for Aries Smart Cable Modules.
The first is to enable higher bandwidth and lower latency GPU-to-GPU and GPU-to-Switch multi-rack connectivity for larger AI clusters. As larger clusters of GPUs are deployed to address the increasing bandwidth and memory demands of AI workloads, AI infrastructure must scale GPU clusters across racks as server racks can only accommodate a certain number of GPUs due to power and thermal management constraints. Aries SCMs extend high-bandwidth PCIe 5.0. and CXL signal reach at 128 GB/s up to 7 meters to enable larger GPU clusters in a multi-rack architecture. Also, Aries SCMs improve cable routing, serviceability, and air flow with thin copper cables to maintain existing rack power and thermal density.
The second is to enable extended CXL reach for low-latency memory fabric connectivity in high-capacity in-memory compute architectures. Hyperscalers are deploying CXL memory expansion and pooling solutions to achieve higher application performance and the distances between the processor and expanded memory resources are increasing. Aries SCMs extend high-bandwidth PCIe 5.0 and CXL signal reach at 128 GB/s up to 7 meters to enable low-latency memory fabrics for scalable cloud infrastructure.
General-purpose AECs lack advanced cable and fleet management features essential managing data center infrastructure, while AECs with Aries SCMs offer system-wide visibility and management features through COSMOS that enable enhanced security, quick debug, and flexible firmware upgrade.
There are two key differences between Ethernet AECs vs PCIe AECs:
- Protocol complexity: PCIe’s backwards compatibility and link training requirements make AECs more complex for PCIe compared to Ethernet.
- Interoperability: The variety of device types and ecosystem players is significantly more for PCIe compared to Ethernet
Aries SCMs extend high-bandwidth PCIe 5.0 signal reach at 128 GB/s up to 7 meters to enable larger GPU clusters in a multi-rack architecture and low-latency memory fabrics for scalable cloud infrastructure.
Aries SCMs are an offering within Astera Labs’ COSMOS suite enables system baseboard/system management controllers (BMCs/SMCs) to utilize an array of customizable diagnostics and telemetry features to enable continuous monitoring of critical server-to-JBOG, JBOG-to-JBOG, and Switch-to-JBOG links. Parameters such as eye opening, equalization levels, junction temperature, and more are monitored, and interrupts to the host can be enabled whenever configurable limits are crossed. A full set of self-test features—host-side and line-side loopback, pseudo-random bit sequence (PRBS) generation and checking, etc.—enable rapid troubleshooting to minimize link down time and accelerate fault isolation.
Aries Smart Cable Modules support the multiple form factors and cable configurations for diverse AI topologies.
Aries Smart Cable Modules can support a variety of gauges up to 7 meters.
“RAS” is the ability of the system to provide resilience starting from the underlying hardware all the way to the application software through three components collectively referred to as “RAS” features:
- Reliability: the ability of the system to detect and correct faults
- Availability: how the system guarantees uninterrupted operation with minimal degradation
- Serviceability: the ability of the system to proactively diagnose, repair, upgrade or replace components at scale
PCIe/CXL Smart Cable Modules FAQ
There are two primary applications for Aries Smart Cable Modules.
The first is to enable higher bandwidth and lower latency GPU-to-GPU and GPU-to-Switch multi-rack connectivity for larger AI clusters. As larger clusters of GPUs are deployed to address the increasing bandwidth and memory demands of AI workloads, AI infrastructure must scale GPU clusters across racks as server racks can only accommodate a certain number of GPUs due to power and thermal management constraints. Aries SCMs extend high-bandwidth PCIe 5.0. and CXL signal reach at 128 GB/s up to 7 meters to enable larger GPU clusters in a multi-rack architecture. Also, Aries SCMs improve cable routing, serviceability, and air flow with thin copper cables to maintain existing rack power and thermal density.
The second is to enable extended CXL reach for low-latency memory fabric connectivity in high-capacity in-memory compute architectures. Hyperscalers are deploying CXL memory expansion and pooling solutions to achieve higher application performance and the distances between the processor and expanded memory resources are increasing. Aries SCMs extend high-bandwidth PCIe 5.0 and CXL signal reach at 128 GB/s up to 7 meters to enable low-latency memory fabrics for scalable cloud infrastructure.
General-purpose AECs lack advanced cable and fleet management features essential managing data center infrastructure, while AECs with Aries SCMs offer system-wide visibility and management features through COSMOS that enable enhanced security, quick debug, and flexible firmware upgrade.
There are two key differences between Ethernet AECs vs PCIe AECs:
- Protocol complexity: PCIe’s backwards compatibility and link training requirements make AECs more complex for PCIe compared to Ethernet.
- Interoperability: The variety of device types and ecosystem players is significantly more for PCIe compared to Ethernet
Aries SCMs extend high-bandwidth PCIe 5.0 signal reach at 128 GB/s up to 7 meters to enable larger GPU clusters in a multi-rack architecture and low-latency memory fabrics for scalable cloud infrastructure.
Aries SCMs are an offering within Astera Labs’ COSMOS suite enables system baseboard/system management controllers (BMCs/SMCs) to utilize an array of customizable diagnostics and telemetry features to enable continuous monitoring of critical server-to-JBOG, JBOG-to-JBOG, and Switch-to-JBOG links. Parameters such as eye opening, equalization levels, junction temperature, and more are monitored, and interrupts to the host can be enabled whenever configurable limits are crossed. A full set of self-test features—host-side and line-side loopback, pseudo-random bit sequence (PRBS) generation and checking, etc.—enable rapid troubleshooting to minimize link down time and accelerate fault isolation.
Aries Smart Cable Modules support the multiple form factors and cable configurations for diverse AI topologies.
Aries Smart Cable Modules can support a variety of gauges up to 7 meters.
“RAS” is the ability of the system to provide resilience starting from the underlying hardware all the way to the application software through three components collectively referred to as “RAS” features:
- Reliability: the ability of the system to detect and correct faults
- Availability: how the system guarantees uninterrupted operation with minimal degradation
- Serviceability: the ability of the system to proactively diagnose, repair, upgrade or replace components at scale
PCIe® FAQ
- Within a Server: CPU to GPU, CPU to Network Interface Card (NIC), CPU to Accelerator, CPU to SSD
- Within a Rack: CPU to JBOG and JBOF through board-to-board connector or cable
- Emerging GPUs-to-GPUs or Accelerators-to-Accelerators interconnects
As the demand for artificial intelligence and machine learning grows, new system topologies based on PCIe 5.0 technology will be needed to deliver the required increases to data performance.
While the transition from PCIe 4.0 architecture to PCIe 5.0 architecture increases the channel insertion loss (IL) budget from 28 dB to 36 dB, there will be new design challenges around the higher losses at higher data rates. In the case of other standards greater than 30 GT/s, the PAM-4 modulation method is usually used to make the signal’s Nyquist frequency one-quarter of the data rate, at the cost of 9.5 dB signal-to-noise ratio (SNR).
However, PCIe 5.0 continues to use the non-return-to-zero (NRZ) signaling scheme, thus the Nyquist frequency of the signal is one-half of the data rate, which is 16 GHz. The higher the frequency, the greater the attenuation. The signal attenuation caused by the channel IL is the biggest challenge of PCIe 5.0 system design.
- CTLE & DFE: PCIe 5.0 specifies the bump-to-bump IL budget as 36 dB for 32 GT/s, and the bit error rate (BER) must be less than 10-12. To address the problem of high attenuation to the signal, the PCIe 5.0 standard defines the reference receiver such that the continuous-time linear equalizer (CTLE) model includes an ADC (adjustable DC gain) as low as -15 dB, whereas the reference receiver for 16 GT/s is only -12 dB. The reference decision feedback equalizer (DFE) model includes three taps for 32 GT/s and only two taps for 16 GT/s.
- Precoding: Due to the significant role of the DFE circuit plays in the receiver’s overall equalization, burst errors are more likely to occur compared to 16 GT/s. To counteract this risk, PCIe 5.0 introduces Precoding in the protocol. After enabling precoding at the transmitter side and decoding at the receiver side, the chance of burst errors is greatly reduced, thereby enhancing the robustness of the PCIe 5.0 32 GT/s Link.
16 dB, but the channel imperfections caused by vias, stubs, AC coupling capacitors and pads, and trace variation further reduce this budget.
View PCIe 5.0 Architecture Channel Insertion Loss Budget Video >
By leveraging advanced PCB materials and/or PCIe 5.0 Retimers to ensure sufficient end-to-end design margin, system designers can ensure a smooth upgrade to PCIe 5.0 architecture.
PCIe 6.0 will adopt PAM4 signaling instead of NRZ used in previous generations to achieve 64GT/s. However, it will remain fully backwards compatible with PCIe 1.0 through PCIe 5.0. Please see our industry news sections for more resources on PCIe 6.0.
The main independent variable in PCIe Link simulations is Transmitter Preset—pre-defined combinations of pre-shoot and de-emphasis, and 10 such Presets are defined in the PCIe specification.
View Signal Integrity Challenges for PCIe 5.0 OCP Topologies Video >
- As the PCB temperature rises, the insertion loss (IL) of the PCB trace becomes higher
- Process fluctuation during PCB manufacturing can result in slightly narrower or wider line widths, which can lead to fluctuations in IL
- The amplitude of the Nyquist frequency signal (16-GHz sine wave in the case of 32 GT/s NRZ signaling) at the source side is 800 mV pk-pk, which will reduce to about 12.7 mV after 36 dB of attenuation. This underscores the need to leave some IL margin for the receiver to account for reflections, crosstalk, and power supply noise that all potentially will degrade the SNR.
Thus, the IL budget reserved for the PCB trace on the system base board should be 16 dB minus some amount of margin, which is reserved for the above factors. Many hardware engineers and system designers tend to leave 10-20% of the overall channel IL budget as margin for such factors. In the case of a 36-dB budget, this amounts to 4-7 dB.
In an add-in-card topology, merely 16 dB system board budget remains, equivalent to ~8 inch trace length, when adding safety margin for board loss variations due to temperature and humidity, even if upgrading to a ultra-low-loss PCB material. Upgrading to expensive “Ultra-low-loss” material will enable ~8 inches. However, the reach requirements can easily exceed ~8 inch in complex topologies.
PCIe 5.0 architecture, like PCIe 4.0 and 3.0 architectures, supports two clock architectures:
- Common REFCLK (CC): The same 100-MHz reference clock source is distributed to all components in the PCIe link — Root Complex, Retimer, and Endpoint. Due to REFCLK distribution via PCB routing, fanout buffers, cables, etc., the phase of the REFCLK will be different for all components.
- Independent REFCLK (IR): Both the Root Complex and End Point use independent reference clocks and the Tx and Rx must meet stringent specifications operating in IR mode compared to the specifications under CC mode. The PCIe Base specification does not specify the properties of independent reference clocks.
Burst errors are not reported any differently than regular correctable/uncorrectable errors. In fact, burst errors may cause silent data corruption, meaning multiple bits in error can lead to an undetected error event. Therefore, it is incumbent on system designers and PCIe component providers to consciously enable precoding if there is a concern or risk of bust errors in a system.
PCI-SIG does not publish official or “standard” channel models; however, the Electrical Workgroup (EWG) does post example channel models. For PCIe 5.0 specification, the reference package models are posted here: https://members.pcisig.com/wg/PCIe-Electrical/document/folder/885.
You can also find example pad-to-pad channel models shared by a few member companies during the specification development by searching *.s24p in the following folder https://members.pcisig.com/wg/PCIe-Electrical/document.
PCI-SIG defines the specifications, but not a tool for the purpose of interoperability testing. ASIC vendors and OEMs/ODMs generally provide/have these tools, for the purpose of testing and stressing the PCIe link, to make sure there are no interoperability issues.
There are multiple connector types and form factors in development, which are targeting PCIe 5.0 signal speeds, including: M.2, U.2, U.3, mezzanine connectors, and more.
There is no industry-standard definition of mid-loss, low-loss, and ultra-low-loss. It is good practice to start from the loss budget analysis to select which type of PCB material is needed for the system. Megtron-6 or other types of PCB material with similar performance as that of Megtron-6 are commonly used in PCIe 5.0 server systems where the distance from Root Complex pin to CEM connector exceeds 10″.
Test methodology is similar to that of CEM 4.0. See details from the PCIe 5.0 PHY Test Spec v0.5.
No, there is no difference.
At this moment, these are not specified in the PCIe 5.0 PHY Test Spec v0.5.
The Lane Margin Test (LMT) is defined in PCIe 5.0 PHY Test Spec v0.5, and RX Lane Margining in time and voltage is required for all PCIe 5.0 receivers. However, according to the test specification, LMT checks whether the add-in card under test implements the lane margining capability. The margin values reported are not checked against any pre-defined pass/fail criteria.
33 GHz for the PCIe 5.0 TX test. See more from PCIe 5.0 PHY Test Spec v0.5.
Passing TX compliance and RX BER test does not guarantee system-level interoperability. It is advisable to perform separate tests to exercise the LTSSM, as well as application-specific tests, such as hot unplug/hot plug, to demonstrate system-level robustness.
The enabling/disabling or Precoding is negotiated during link training. Whether Precoding is needed or not is largely dependent on the specific receiver implementation. As an example, receivers that rely heavily on DFE tap-1 may choose to request Precoding during link training. So, each receiver will make its own determination, based on the receiver architecture, as to whether it should request Precoding or not. Precoding is defined in the PCIe 5.0 specification but not in the PCIe 4.0 specification.
The PCIe 5.0 specification introduces selectable Precoding. Precoding breaks an error burst into two errors: an entry error and an exit error. However, a random single-bit error would also be converted to two errors, and therefore a net 1E-12 BER with precoding disabled would effectively become 2E-12 BER with precoding enabled.
PAM4 stands for Pulse Amplitude Modulation Level 4, and is a type of signaling that caries 2 bits (00, 01, 10, or 11) at a time instead of 1 bit (0 or 1) used in previous PCIe generations.
The largest challenge will be handling higher error rates. To address this, the PCIe 6.0 standard will also begin to implement Forward Error Correction (FEC).
CXL® FAQ
CXL is needed to overcome CPU-memory and memory-storage bottlenecks faced by computer architects. Future data centers need heterogeneous compute, new memory and storage hierarchy, and an agnostic interconnect to tie it all together. CXL maintains memory coherency between the processor memory space and memory on attached devices to enable pooling and sharing of resources to provide higher performance, reduce software stack complexity, and lower overall system cost.
Traditional DRAM and persistent storage class memory (SCM) are supported, allowing for flexibility between performance and cost.
Compute Express Link (CXL) is an open industry standard interconnect offering high-bandwidth, low-latency connectivity between the host processor and devices including accelerators, memory expansion, and smart I/O devices. CXL utilizes the PCIe® 5.0 physical layer infrastructure and the PCIe alternate protocol to address the demanding needs of high-performance computational workloads in Artificial Intelligence, Machine Learning, communication systems, and HPC through the enablement of coherency and memory semantics across heterogeneous processing and memory systems.
The CXL protocol supports three different type of devices:
- Type 1 Caching Devices / Accelerators
- Type 2 Accelerators with Memory
- Type 3 Memory Buffer
- Memory tiering in which additional capacity is applied with a variable mix of lower-latency direct-attached memory and higher-latency large capacity memory
- Higher VM density per system by having more memory capacity attached
- Large databases can use a caching layer provided by SCM to improve the performance
- CXL.io is used for initialization, link-up, device discovery and enumeration, and register access. It provides a non-coherent load/store interface for I/O devices similar to PCIe® 5.0.
- CXL.cache defines interactions between a Host and Device, which allows CXL devices to cache host memory with low latency.
- CXL.mem provides a Host processor with direct access to Device-attached memory using load/store commands.
CXL runs on PCIe® 5.0 electrical signals. CXL runs on PCIe PHY and supports x16, x8, and x4 link widths natively.
CXL 2.0 adds support for switching, persistent memory, and security as well as memory pooling support to maximize memory utilization, reducing or eliminating the need to over-provision memory.
In traditional servers, memory is directly connected to a specific CPU or GPU (i.e., locked behind the host) and can result in over-provisioning of memory resources when applications are not using the available memory. When the memory is over-provisioned to a specific host, the memory is now stranded and cannot be accessed by other hosts, thereby increasing data center costs. In addition, when memory is locked behind a host, the data being processed by the application needs to be copied through high latency interconnects if a different CPU or GPU needs access to the data.
Memory pooling allows multiple hosts in a heterogenous topology to access a common memory address range with each host being assigned a non-overlapping address range from the “pool” of memory resources. Memory pooling allows system integrators to dynamically allocate memory from this pool, which reduces costs by reducing stranded memory and increasing memory utilization. Memory pooling is part of a growing trend for resource disaggregation or composability for heterogeneous solutions.
Memory sharing allows multiple hosts in a heterogeneous topology to access a common memory address range with each host being assigned the same address range as the other host. This improves memory utilization similar to memory pooling, but also provides an added benefit of data flow efficiency since multiple hosts can access the same data. With memory sharing, coherency needs to be managed between the hosts to ensure data is not overwritten by another host incorrectly.
CXL 2.0 supports Integrity and Data Encryption (IDE) and key exchange protocols for to provide end-to-end protection of data on the CXL link.
“RAS” is the ability of the system to provide resilience starting from the underlying hardware all the way to the application software through three components collectively referred to as “RAS” features:
- Reliability: the ability of the system to detect and correct faults
- Availability: how the system guarantees uninterrupted operation with minimal degradation
- Serviceability: the ability of the system to proactively diagnose, repair, upgrade or replace components at scale
The CXL 3.0 specification doubles the bandwidth to 64 GT/s while enabling additional usage models beyond the CXL 2.0 specification through introduction of advanced fabric capabilities such as the following:
- Global Fabric Attached Memory (GFAM)
- Enhanced link level integrity and data encryption (CXL IDE) for 256B flits
- Improved resource utilization for composable disaggregated infrastructure through multi-level switching, multi-headed devices, and multiple type1/type2 devices per root port
Smart Memory Controllers FAQ
CXL is needed to overcome CPU-memory and memory-storage bottlenecks faced by computer architects. Future data centers need heterogeneous compute, new memory and storage hierarchy, and an agnostic interconnect to tie it all together. CXL maintains memory coherency between the processor memory space and memory on attached devices to enable pooling and sharing of resources to provide higher performance, reduce software stack complexity, and lower overall system cost.
Traditional DRAM and persistent storage class memory (SCM) are supported, allowing for flexibility between performance and cost.
Compute Express Link (CXL) is an open industry standard interconnect offering high-bandwidth, low-latency connectivity between the host processor and devices including accelerators, memory expansion, and smart I/O devices. CXL utilizes the PCIe® 5.0 physical layer infrastructure and the PCIe alternate protocol to address the demanding needs of high-performance computational workloads in Artificial Intelligence, Machine Learning, communication systems, and HPC through the enablement of coherency and memory semantics across heterogeneous processing and memory systems.
The CXL protocol supports three different type of devices:
- Type 1 Caching Devices / Accelerators
- Type 2 Accelerators with Memory
- Type 3 Memory Buffer
- Memory tiering in which additional capacity is applied with a variable mix of lower-latency direct-attached memory and higher-latency large capacity memory
- Higher VM density per system by having more memory capacity attached
- Large databases can use a caching layer provided by SCM to improve the performance
- CXL.io is used for initialization, link-up, device discovery and enumeration, and register access. It provides a non-coherent load/store interface for I/O devices similar to PCIe® 5.0.
- CXL.cache defines interactions between a Host and Device, which allows CXL devices to cache host memory with low latency.
- CXL.mem provides a Host processor with direct access to Device-attached memory using load/store commands.
CXL runs on PCIe® 5.0 electrical signals. CXL runs on PCIe PHY and supports x16, x8, and x4 link widths natively.
CXL 2.0 adds support for switching, persistent memory, and security as well as memory pooling support to maximize memory utilization, reducing or eliminating the need to over-provision memory.
In traditional servers, memory is directly connected to a specific CPU or GPU (i.e., locked behind the host) and can result in over-provisioning of memory resources when applications are not using the available memory. When the memory is over-provisioned to a specific host, the memory is now stranded and cannot be accessed by other hosts, thereby increasing data center costs. In addition, when memory is locked behind a host, the data being processed by the application needs to be copied through high latency interconnects if a different CPU or GPU needs access to the data.
Memory pooling allows multiple hosts in a heterogenous topology to access a common memory address range with each host being assigned a non-overlapping address range from the “pool” of memory resources. Memory pooling allows system integrators to dynamically allocate memory from this pool, which reduces costs by reducing stranded memory and increasing memory utilization. Memory pooling is part of a growing trend for resource disaggregation or composability for heterogeneous solutions.
Memory sharing allows multiple hosts in a heterogeneous topology to access a common memory address range with each host being assigned the same address range as the other host. This improves memory utilization similar to memory pooling, but also provides an added benefit of data flow efficiency since multiple hosts can access the same data. With memory sharing, coherency needs to be managed between the hosts to ensure data is not overwritten by another host incorrectly.
CXL 2.0 supports Integrity and Data Encryption (IDE) and key exchange protocols for to provide end-to-end protection of data on the CXL link.
“RAS” is the ability of the system to provide resilience starting from the underlying hardware all the way to the application software through three components collectively referred to as “RAS” features:
- Reliability: the ability of the system to detect and correct faults
- Availability: how the system guarantees uninterrupted operation with minimal degradation
- Serviceability: the ability of the system to proactively diagnose, repair, upgrade or replace components at scale
Quality FAQ
- If you need to return potentially defective material, please contact Astera Labs’s Customer Service organization.
- The Quality team will run an evaluation based upon customer-generated diagnostic logs, production test results, and PCIe system testing, and will share the results using an 8D process.
All device qualification data, including FIT calculation, is included in the qualification summary document. Contact us or ask your Astera Labs Sales Manager for further information.
Our goal is to provide consumers with the highest quality products by assuring their performance, consistency and reliability.
Our team values are integral to who we are and how we operate as a company.
To ensure a consistent supply to meet our customer’s high volume demands, Astera Labs implements in multi-vendor and multi-site manufacturing. This approach gives us a strong business continuity/contingency plan in case of catastrophic events (e.g., earthquake, tsunami, flood, fire, etc.) to ensure or recover supply quickly.
Ordering FAQ
Customers can order directly from Astera Labs, or can order from one of our franchised partners, which currently include Mouser, EDOM, Eastronics, and Intron.
Purpose-built Retimer IC’s, Riser Cards, Extender Cards, and Booster Cards for High-performance Server, Storage, Cloud, and Workload-Optimized Systems
To ensure a consistent supply to meet our customer’s high volume demands, Astera Labs implements in multi-vendor and multi-site manufacturing. This approach gives us a strong business continuity/contingency plan in case of catastrophic events (e.g., earthquake, tsunami, flood, fire, etc.) to ensure or recover supply quickly.
Please review the Astera Labs Terms of Sale.
Have more questions about Astera Labs products or Technology? Get in touch with an Astera Labs expert.