Intel Supports UALink for Scale-up Networking

Nolan Morgan

June 11, 2025

6 min read

By: Josh Collier and Mark Davis (Intel Corporation)

Large Language Models (LLMs) have quickly become one of the most important workloads for modern AI Platforms in the datacenter. The High-BW Domain (HBD) required to support emerging LLMs continues to grow, not just based on first token response or workload fit into high-BW memory, but increasingly based on tokens/s response time. The response time is increasingly under pressure from both agentic use cases and reasoning use cases. Scaling the performance and capacity of these emerging AI workloads requires a significant investment in scale-up networking.

At Intel, we appreciate that UALink provides an open (without restriction), standards-based baseline and is purpose-built to support these emerging AI workloads at Rack Scale. A few of the more critical, differentiating features of UALink include:

Memory Semantics Protocol
Low Latency Switching
Efficient and Scalable Bandwidth
In-Cast and Congestion Tolerance
Network Reliability
Low Cost and Optimized TCO
Collective Offloads
Confidential Compute
Open Standard and Ecosystem

This blog takes an in-depth look at each of these features and explores how UALink helps deliver them.

Memory Semantics Protocol

Memory semantic networks naturally align with highly programmable accelerator workloads, which already use memory semantics (load/store/atomics operations) to access multiple types of memory (accelerator HBM, CPU DDR, PCIe peer device MMIO, …).

Buffers on remote accelerators can easily be mapped into the same accelerator virtual address space as buffers that map to other memory types.
Accelerators that have on-die memory semantics networks for accessing local memory and bridging to a scale-up memory semantic network are low in complexity and cost.

Memory semantics achieves low latency communication between accelerator engines, which is critical for scaling workloads.

Store or write operations can move data from an initiating accelerator’s engine directly to a remote accelerator’s cache or memory, where the remote accelerator’s engines can consume with low latency.
Load or Read operations can fetch data from a remote accelerator’s cache or memory and return the data to the initiating accelerator’s engine for immediate use.
Atomic operations can modify a remote accelerator’s cache or memory, such as signaling a semaphore.

Low Latency Switching

Network latency is an overhead when accelerators communicate and is additive to the overall collective latency. Collective operations often involve multiple steps or phases of communication, further amplifying the impact of the network latency. UALink allows for very low switch latencies by simplifying transaction processing and routing, as well as reducing the FEC burden, helping to keep network latencies very low. Taking it a step further, memory semantics enables low-latency communication protocols that use writes over the network, avoiding the full RTT (Round Trip Time) in the critical latency path of communication between accelerators.

Efficient and Scalable Bandwidth

Accelerator scale-up bandwidth requirements vary based on the workload and the capabilities of the accelerator (Bytes:FLOP), but can be measured in TB/s. These high bandwidths are not achieved through a single interface, which would be problematic to scale. Instead, accelerators achieve high bandwidth through a combination of a highly efficient wire protocol optimized for memory semantics as provided by UALink, along with load balancing over many parallel interfaces, often referred to as multiple rails.

In-Cast and Congestion Tolerance

AI workloads can scale to many accelerators. Workloads try to avoid in-cast and congestion by coordinating collective communications, such as using ring or tree algorithms, but some in-cast is often unavoidable. UALink provides credit-based flow control, which easily handles periods of in-cast and congestion, allowing the network to progress without adverse effects.

Network Reliability

Networks have many potential sources of errors, including both persistent and transient errors. Persistent errors should have a low probability and require intervention to resolve, but transient errors can occur frequently and require mitigation to avoid workload impact. Unfortunately, transient errors can occur frequently, especially in large-scale or high-bandwidth networks. Typical transient errors include uncorrectable FEC or CRC as well as other link errors, soft errors in data paths, and some networks will drop traffic when congested due to a lack of buffering. UALink is specifically designed to be a reliable network by:

Avoiding drops due to congestion by utilizing credit-based flow control.
Avoiding large tail latencies for networks that only offer end-to-end reliability by utilizing LLR (Link Level Replay) to recover from uncorrectable FEC, CRC, or link errors in under a microsecond.
Avoiding data path errors by advocating for reliable data path designs utilizing ECC and other techniques.

Low Cost and Optimized TCO (perf/$, perf/w)

The cost of the scale-up network directly impacts the TCO of the product and continues to grow as a percentage of the overall solution or Rack TCO. The obvious goal is to meet system functional and performance requirements with the lowest cost solution. The scale-up network incurs costs that include:

Accelerator die IP and package area, shoreline, and power distribution.
Switching ASICs, power distribution, boards, manageability/control plane overhead, and infrastructure/chassis.
Channel costs include electrical connectors, cables/cable walls/midplanes, and potentially optical solutions.

While many of these costs are unavoidable with all scale-up fabrics, UALink achieves low cost and an optimized TCO by first reducing die area and die power on accelerators and switches by avoiding bulky and complex NIC logic. Additionally, UALink optimizes switching through simple and small routing tables, shallow buffering, and most significantly by vastly simplifying the management interface.

Collective Offloads

Offloading collectives to the scale-up network seeks to achieve a latency benefit for blocking communications or an offload benefit for non-blocking communications, where accelerator engines continue working on other important tasks. The desired benefit depends on the workload, factoring in the ability to overlap compute and communications as well as the latency sensitivity of the communications; however, for AI inference workloads, reducing latency is typically preferred. UALink 2.0 will offer a collective offload feature based on the memory semantics protocol.

Collective latency depends on many factors, including the collective type (reduce, all-reduce, broadcast, …), the size (KB, MB, GB), the scale (# of accelerators), and the latencies of both the network and accelerators. Some of the significant latency benefits for collective offload include:

Reduced network latency since communications are between an accelerator and a switch instead of between accelerators, resulting in reduced RTT.
Traditional P2P collective algorithms scale the number of accelerators using ring or dual binary tree algorithms, resulting in either linear or logarithmic latency based on the network RTT and accelerator overheads. Collective offload in a switch can internally implement a ring or tree reduction, and since the latency between member “ports” inside the switch can be measured in 10s of ns compared to 1uS or more for P2P (Point-to-Point) communication, the overall latency is reduced.
A typical All-Reduce using P2P communications requires both transmitting and receiving almost 2X the data, while an offloaded AllReduce requires only transmitting and receiving 1X the data, which potentially yields a 2X effective increase in All-Reduce bandwidth, helping to significantly reduce the latency of larger collective operations.

Confidential Compute

Many workloads are sensitive based on significant investments of intellectual property or are handling confidential data. To securely run these workloads, security is needed, including confidentiality that prevents the workload or data from being exposed, as well as integrity, which prevents tampering with the workload. UALink provides efficient security for a high bandwidth memory semantic protocol, achieving both confidentiality and integrity while optimizing the security costs, including die area, power, complexity, performance and latency, and ease of manageability.

Open Standard and Vast Ecosystem

UALink released the 1.0 specification in April of 2025, focused on optimized memory semantics. This specification enables many IP, Switch, and product vendors without adoption restrictions or IP mandates, facilitating an open standard.
The UAL switching (and re-timer) ecosystem is significant and continues to grow, allowing for the development of many standard and custom product offerings.

In summary, UALink scale-up networking plays an important role in our future datacenter AI product offerings at Intel, and we are thrilled to fully support this open, standards-based ecosystem. UALink1.0 is just the beginning; we are hard at work with the consortium members on innovations, like collective offload, that may be part of UALink2.0, the next evolution of the specification. We can’t wait to see how the industry collectively innovates around UALink scale-up networking to address the rapidly growing, constantly evolving, and shifting portfolio of AI workloads, including latency-sensitive agentic and reasoning LLMs.

Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

All product plans and roadmaps are subject to change without notice.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Legal Notices and Disclaimers