UALink™ 200G 1.0 Specification Overview

4 min read

By: Anish Mathew, Arif Khan, Joe Chen, and Gautam Singampalli (Cadence)

 

Scaling up Using UALink

Current standards fall short when it comes to scaling AI, which is precisely where Ultra Accelerator Link™ (UALink™) proves to be a transformative solution. UALink is an open standard designed to enable effective accelerator communications and expansion within a pod. It offers significant performance headroom to handle intensive workloads, making it ideal for AI applications. The scale-up memory fabric ensures low latency and simple load/store semantics while supporting atomic operations and direct memory access (DMA). Additionally, UALink provides high bandwidth per unit of power and area, making it a highly efficient solution for scaling up AI infrastructure.

As AI workloads continue to grow, the need for both scale-up and scale-out strategies becomes critical. Scale-up involves enhancing the capacity and bandwidth at the rack level to handle more data and more complex models. This is where UALink excels, providing the necessary infrastructure to support increased computational demands.

Figure 1: Motivation behind scale-up

 

Key UALink Features

UALink offers a robust set of features designed to enhance performance and scalability. Using memory semantics with read, write, and atomic operations, UALink ensures efficient data handling. Coherency is maintained through software with transactions ranging from 64 to 256 bytes. The simple source/destination-based routing can handle up to 1024 endpoints, with future releases potentially extending this to 4096 endpoints. UALink utilizes a vendor-defined, 57-bit physical address space and supports ordered request addresses from source to destination, though completions are not ordered.

A system node in UALink consists of one or more hosts connected to one or more accelerators using specific interconnects such as PCIe®, CXL®, or CHI C2C. These nodes are typically managed under a single OS image. UALink is designed to provide a low-latency interconnect across system nodes, ensuring efficient communication and data transfer. However, it is important to note that a host CPU may not access memory attached to a remote system node, maintaining the integrity and performance of the local system.

Figure 2: UALink connectivity across system nodes

 

The Multi-Node Accelerator System (M Accelerators x N Ports) for UALink is designed to enhance scalability and performance in AI compute environments. A UALink station is defined as a group of four UALink lanes, supporting a maximum of 800 Gb/s traffic in v1.0. This system allows multiple accelerators to connect through numerous ports, facilitating efficient data transfer and communication across nodes. By leveraging UALink’s low-latency interconnect, the system ensures optimal performance for intensive AI workloads, enabling seamless expansion and integration of accelerators within a network.

Figure 3: Scalable multi-node accelerator system with UALink high-speed interconnect

 

Translation in the Partitioned Global Address Space

In the Partitioned Global Address Space, the Memory Management Unit (MMU) maintains page table entries (PTE) to manage memory imports via OpenSHMem or a custom shared memory library. These PTEs correspond to in-domain or cross-domain addresses, identified by an address translation (AT) bit. Address pointers are split into 10 bits of accelerator ID and an address handle, which are managed from the host over the Ethernet network. When importing memory, the accelerator creates an entry in its PTE for the destination handle and destination accelerator ID. Conversely, the exporting accelerator creates an entry in its PTE for the source accelerator ID and the address handle, ensuring efficient memory translation and access across the system.

The diagram in Figure 4 shows how address translation and data transfer work between a source and destination accelerator:

  1. Source Accelerator: Starts with a virtual address, translated by the ACC MMU to a physical address.
  2. Network PA / Local Host SPA: Depending on the address type, it uses either Network PA or Local Host SPA.
  3. Cross-Domain Request: The Link MMU handles cross-domain address translation.
  4. Destination Accelerator: Receives the translated address for data transfer.

 

This process ensures efficient communication and data handling between accelerators.

Figure 4: Address translation and data transfer in partitioned global address space

 

Addressing in UALink

Figure 5 shows how address translation and data transfer work between a source and destination accelerator. The source accelerator starts with a virtual address, which the Accelerator MMU translates into a Network Physical Address (NPA) or System Physical Address (SPA). The Link MMU then handles cross-domain requests, translating these addresses for the destination accelerator. This process ensures efficient communication and data handling across different system nodes.

Figure 5: UALink efficient address translation and data transfer across system accelerators

 

UALink Protocol Stack

UALink consists of a well-defined protocol stack, which includes the following layers:

  • Protocol Layer: UPLI (UALink Protocol Level Interface) is compatible with UALink switches and offers features such as read, write, and atomic memory operations. It supports split request/response messages to improve bandwidth utilization and uses identification tags per port to handle multiple outstanding requests.
  • Transaction Layer: The Transaction Layer (TL) manages the credits and handles the packing and unpacking of the TL flits. It connects two UPLI interfaces: one from a UPLI Originator and one from a UPLI Completer.
  • Data Link Layer: The Data Link Layer (DL) handles the transfer of data between the Transaction Layer (TL) and the Physical Layer (PL).
  • Physical Layer: The Physical (PHY) Layer includes the Physical Coding Sublayer (PCS) and is based on the IEEE P802.3dj draft standard. This allows the leveraging of existing Ethernet technology and tweaking it to meet the specific needs of UALink.

 

Each layer plays a crucial role in ensuring efficient communication and data transfer within the system. This layered approach allows for modularity and flexibility, making it easier to manage and optimize the protocol for various applications.

Figure 6: UALink protocol stack—a layered design for efficient system communication

 

Efficient scaling of AI accelerators is crucial for achieving high performance and throughput. UALink, the de-facto standard for AI accelerator interconnects, plays a pivotal role in enabling this scalability by providing low-latency, high-bandwidth communication between accelerators. As an active member of the UALink Consortium, Cadence offers fully verified UALink IP subsystems, including both controllers and silicon-proven PHYs. These subsystems are optimized for robust performance in both short and long-reach applications, delivering industry-leading PPA.

Download the UALink 200G 1.0 Specification and the UALink 1.0 Specification White Paper to learn more.

LinkedIn