ADFS 2011 Wrapup & AMD Lynx Platform Tests



Company: AMD
Author: James Prior
Editor: Charles Oliver
Date: July 18th, 2011

Keynote: Evolution of AMD Graphics

Eric Demers Keynote at AFDS '11

For Graphics enthusiasts and the ATI faithful, the final keynote of the Summit was the one they were waiting for - Eric Demers and Graphics Core Next Gen. This was spoilered somewhat by the previous evening's presentation from Mike Houston and Mike Mantor, the first real scheduling fail. Keynote first, deep dive an hour later, not deep dive the previous evening at 5pm, keynote the next day.

1st Era Graphics

The keynote spent some time looking back at the graphics processors, from the first era of fixed function, graphics specific hardware - and a familiar name, the ATI Rage. After 2002, the first era ended and the simple shader era began, which included the awesome R300 in the ATI Radeon 9700 Pro. After 2006, the R600 stumble and the AMD acquisition, the third era of the graphics parallel core began. This era is the current era, where graphics is key; a unified shader architecture is required, and capable of processing basic general purpose compute.

3rd Era Graphics Evolves with Compute

During the third era we have seen the rise of the VLIW-5 architecture, which held graphics performance as its primary objective. VLIW-4, with symmetrical stream core design, was introduced in 2010 with Cayman and the AMD Radeon HD 6900 series. It'll be the compute engine inside next year's APUs, and likely the main series of discrete GPUs in the next series of GPUs due to launch in the fourth quarter of 2011. Cayman provided multiple virtual workloads and introduced PowerTune, to keep the high power cards in check.

AMD Graphics Core Next Gen

The new graphics architecture is based around compute units, which are abandoning VLIW and moving to a 'pure SIMD' design. The big changes are the addition of x86 virtual memory and context switching, with observable and controllable processing. These changes are fundamental for AMD to be chosen as a viable heterogenous compute platform, and for the GPU to be elevated from a bolt-on device to being peer level, expected to be in the system and required for platform operation.

For processing this new compute workload, several changes were made to cache, including adopting a full read/write level-2 cache design. Each compute unit (CU) has a 16KB L1 cache, and each CU's L1 is interconnected to all the L2s (each 64KB with read/write). PCI-Express tunneling will be used to handle coherency between the CU's L2 and the CPU, and PCI-Express 2.1/3.0 equipped platforms will benefit the most with their support for command/data compression and switching. This extends the capabilities offered by the DirectX 11 specification, and the hardware context switching first seen in Northern Island's Cayman architecture. Another new capability is to be part of x86 memory addressing using address translation so that code inside and outside of the GPU can see other memory locations. Full ECC is now supported, improving on the CRC method previously used to detect errors.

In addition to the vector processing capabilities, the keynote showed how scalar processing was baked in under various guises under the years. The point here being that it's not just a vector processor, it has scalar processing abilities, elements from various architectures like MIMD, SIMD, and SMT to help the architecture be more capable. The new architecture also has support for x86 virtual memory, where the IOMMU in the northbridge (inside the APU, in AMD's new FSA) will be able to address memory attached to the GPU device. Move the data is out, switch the compute is in; the GPU offer support for the same unified address space as the CPU, under the operating system's control. Having a unified address space for discrete GPUs could also sound the death knell for 32-bit operating systems as the common standard of 4GB for a desktop/laptop adds a 1GB or 2GB dGPU to mix.

Also exciting is the support of multiple command streams - independent from each other, and asynchronous. This allows applications to utilize the compute power even while others do so, and it's not restricted to compute but to graphics as well. Prioritization and resource allocation abilities permit the ability to play full screen games while (for example) background transcoding video for upload to social media, or creating a picture library index based on facial recognition, and so on.

What's missing from the presentation is any indication of sizing of the compute units - how many SIMDs. Semi-accurate.com recently reported that AMD's next generation has taped out, and that the introduction of the new cards is waiting for the manufacturing process node to ramp up. Speaking with Dirk Meyer, we were able to discern that AMD is aiming for Q4 for the new cards, although Eric Demers hinted that we might see APUs with new architecture in them before we see discrete add-in boards - although he wasn't specific to which generation he was referencing. Our best guess right now is that GlobalFoundries is waiting on baking Trinity for the end of the year with volume in Q1 2012. Following the keynote we were lucky enough to grab some one-on-one time with Eric Demers, and you can read that interview here.