UALink for Scale-Up and UEC for Scale-Out: An in-depth discussion with AMD

5 min read

Experts from UALink board member AMD recently sat down with the UALink Press team to talk about the complementary aspects, differences, and benefits of UALink for scale-up and UEC for scale-out networks. Following is a peek behind the curtain of this dynamic and engaging discussion with Kurtis Bowman, Director of Architecture and Strategy, Aaron Grabein, Director of Commercial, AI PR, and Danny Moore, Director, AI Technology and Ecosystem Strategy, AMD.

 

UALink: Let’s start simply, can you describe for us the difference between scale-up and scale-out networks?

 

Kurtis: Scale-up requires many GPUs or accelerators. Think big units and duplicates of those units, so now there are multiple accelerators all grouped together – we call these pods. These pods are all connected and share information at extremely high speeds, which requires a lot of bandwidth. In the scale-out environment, the pods are clustered together, and those clusters now need rapid, functional connection. UEC comes into play by supporting devices with up to a million GPUs in a cluster. 

 

It is important to note at this point that UALink and UEC have an active working relationship, having established a Memorandum of Understanding, or MOU. This not only allows but encourages the organizations to talk with each other and work together to ensure that industry standards developed by the two will work cohesively. This level of collaboration is one of the many benefits that open industry standards provide, as shared by the group later in this post.

 

UALink: Can you provide a deeper dive into the benefits of UALink for scale-up, as you see them?

 

Aaron: Well, to start, UALink was designed and is optimized for a scale-up network, which means we have a very efficient protocol. What we are seeing is 200G – a 93% efficiency rate – due to more bandwidth across links compared to previous solutions.

 

Danny: We are very focused on latency. And because the switch and output are simpler, end-to-end latency is 2-3x lower than a typical Ethernet solution.

 

Kurtis: Because of the simplicity, the silicon size is smaller. It’s actually 3x smaller than a typical Ethernet solution, which means it consumes less power. At 200G, power can be measured at 75-100 watts per GPU. That saved power can be taken as a financial benefit, or it can be applied back to the compute unit, leading to better compute performance.

 

UALink: I’m gathering that latency and efficiency are extremely important aspects of UALink and UEC in network environments.

 

Danny: Absolutely! Latency is more deterministic, it has a narrower window, and we are seeing consistent performance improvement with UALink-based solutions.

 

Kurtis: Efficiency – packet or flit – is the key, and this is fundamentally different between UALink and Ethernet, but they both have a place in improving network environments. UEC or RoCE requires 82B of overhead, while UALink needs 16-ish.

 

Aaron: If you look at payload vs header as non-payload overhead, you’re getting much better efficiency, specifically for memory semantics, but this is a purely technical delta between the two options.

 

UALink: Can you tell us a bit more about the benefits of UALink and UEC working together to tackle this scaling challenge?

 

Kurtis: Well, the big thing is that UALink imagines up to 1k devices in a pod. While we can take that up to 4k devices, UALink isn’t looking at a super huge scale like UEC. They are considering 1M devices in a cluster. Given these differences, the two orgs complement each other by growing training models as they increase in parameter size, which could be in the tens of thousands of GPUs very quickly.

 

Aaron: Basically, you need the pods created using scale-up and clusters developed using scale-out to create the systems. So, the relationship between UALink and UEC is essential as it allows our groups to clearly define the ways we work together. There are simple things, like co-marketing efforts, but there are also deep technical discussions around efficiently bridging scale-up and scale-out environments.

 

Danny: There are also storage considerations. We need the ability to have these checkpoint conversations along the way so it can be developed efficiently. The interface between the two groups being defined with an MoU makes all the difference in advancing the technologies.

 

UALink: Are you worried at all about overlap between the two organizations or proprietary solutions that could offset the work of UALink and UEC?

 

Kurtis: Not at all. Because we are both standards bodies with more than 200 industry-leading members across a variety of verticals, we are able to tackle a broader spectrum of challenges. And since much of our membership consists of the same companies all looking for compatible solutions that work for them, we are bringing together innovators from multiple technology spaces to meet the market’s needs. For example, if something doesn’t fit well in a scale-up solution, we can talk to UEC and see if they might be a better fit. By collaborating, we are allowing innovation to take off. And it also improves the flexibility of the technology to meet demand. With multiple companies providing input and standards bodies communicating with one another, we can ensure interoperability between standards and leading technology solutions.

 

Danny: I couldn’t agree more. I’ve been involved with industry standards for some time, and I really feel that they are the key to successful technologies. By bringing the best minds together to create and implement industry standards, we can ensure interoperability, as opposed to a one-off or proprietary solution environment.

 

Aaron: I think the meta point here is openness. The future of AI will not be built by one company. We believe strongly that having open engagement, standards, platforms, and technologies – from software to hardware – is for the betterment of everyone. Being shackled to a single solution option doesn’t work. Developers need flexibility to do what is best for their customers and end users. “Open” is the needed proxy to drive the industry forward into what it can and should do in the future.

 

Kurtis: That’s right. That’s why we do this.

 

Learn more about UALink technology and the benefits of open standards bodies in other blog posts.

LinkedIn