Heterogeneous Memory Architecture Gains hUMA



Company: AMD
Author: James Prior
Editor: Sean Ridgeley
Date: April 30th, 2013

With benefits for all

Today AMD announces details of an important new technology for the Hetergeneous System Architecture (HSA) platform specification: hUMA. hUMA represents heterogeneous memory access, and refers to a key capability required for the true integration of CPU and GPU workloads: the ability to have either device use shared memory without copies or transfers.

Heterogeneous Memory Architecture

In a phone briefing with Phil Rogers, AMD Corporate Fellow and Joe Macri, AMD Corporate Vice President and Product CTO, we heard about the latest innovation in memory model. This was telegraphed by the Sony PlayStation 4 announcement which revealed the semi-custom APU from AMD featured in the PS4 permitted a unified memory space, to be provided by 8GB of high speed graphics type memory. Phil and Joe are the remaining two members of the original AMD braintrust that created HSA, which also included Eric Demers (now Qualcomm VP of Engineering) and Chuck Moore (RIP).

'Preserving investment in software and programming models is paramount'

- Phil Rogers, AMD Corporate Fellow

This innovation is driven by the need for heterogeneous computing, with a 'move the compute not the data' mantra. Ten years ago, the memory innovation was 64-bit, where AMD's Athlon64 succeeded against various competitor designs to increase the address space. AMD64 was successful because it was simple and easy for engineers to adopt 64-bit, as AMD's solution was backwards compatible. Phil says AMD64 delivered on a paramount need for software developers, preserving their investment in then current programming and design models.

'We cannot damage what we have already built'

- Joe Macri, AMD Corporate Vice President and Product CTO

The big explosion in compute performance is currently being delivered by GPUs, now offering more than 10 times that of the CPU (a slightly unfair comparison as GPUs are currently constrained by a much higher power ceiling than CPUs are because of the add-in board specification limits). In current APUs, the x86 cores deliver around 1/5th the FLOP performance of the GPU, which, while a long way from the headline 10 times figure, is a significant enough increase to warrant migrating workloads to the GPU where it makes sense. Performance per watt is key in all areas of major systems design and GPUs are superb at massively parallel operations.

Evolution of the HSA

Modern OSs limit page locks to half the installed addressable memory, a feature we've seen for integrated and APU graphics for a while -- you can't overcommit or even dynamically allocate memory to a GPU device. hUMA changes this, permitting a single pool of memory to be used by either device (CPU or GPU) and tasks to move seamlessly between them.

This also includes embedded DRAM (and discrete GPUs to a degree). hUMA will allow for eDRAM to be mapped into the address space as conventional memory or as cache, and will work well with either. Discrete GPUs will be able to work directly with the main system memory but not another discrete GPUs VRAM, and vice versa -- the main system will have the ability to query dGPU memory though extended addressing but not native access or control.

Platform Memory Architecture Differences

AMD's APU codenamed Kaveri is the first product we'll see from AMD with this feature. Interestingly, Kabini doesn't have it. Kaveri will feature Steamroller cores with GCN cores (Sea Islands family, not Volcanic), and should be an extension of the FM2 platform currently in place for AMD's Trinity and Richland APUs.

Cache Coherency in HSA platform

Resource pools are the paradigm that has been moving through enterprise computing for a while now, reducing silos of storage, CPU, and RAM, and making it shared and available for everyone. This is happening at all levels, from datacenter locations and consolidation (the cloud movement) to platform providers (virtualization, storage) and is now hitting the compute platform itself, in the form of hUMA. The first applications that will benefit from these advancements are games.

'Programmers aren't engineers, they should be treated as artists'

- Joe Macri, AMD Corporate Vice President and Product CTO

Heterogeneous unified memory architecture platforms like those from AMD's semi-custom design team will be able to have much larger textures, and new game engine designs that don't count GPU render cycle as ‘dark' time but instead spin up multiple concurrent and asynchronous command threads for the GPU. The end result is getting the hardware out of the way of the programmer, to enable more capabilities and remove a bottleneck to creativity.

To this end, hUMA is fully compatible with the modern major memory models, including Python, JAVA, C11, and .NET. It's also extremely friendly to virtulization, with the GPU page table walker permitting nesting of multiple levels of hypervisor, physical, and user mode access. Properly formed applications already existing will find hUMA robust and stable; the aim is not to break what is working within the standards and documented features.

HSA Market Opportunities

The Heterogeneous System Architecture standard and platform represent AMD's path of creating a standards based level playing field in which they can leverage their IP, collaboration, and design engineering expertise to capture market share. Breaking loose of traditional x86 platform design and innovating with a new set of peer level devices and partners gives AMD a tremendous opportunity.

AMD is holding the APU Developer Summit (formerly known as the AMD Fusion Developer's Summit in 2011 and 2012) in San Jose, California in November. Expect to see industry leading figures presenting new ideas and concepts in the keynotes, sessions, and demonstrations.