Eric Demers and GCN, Part II



Company: AMD
Author: James Prior
Date: December 22nd, 2011

Triangles, Tesselation and Cache

Today AMD launched their new Radeon HD 7970 product, part of the Southern Islands family and codenamed Tahiti. The HD 7970 is the new single GPU performance leader - more on that in our review, soon(ish) - and brings a brand new architecture to the table: Graphics Core Next. The first details of GCN came to light at the AMD Fusion Developer Summit, and you can read our initial coverage of GCN here and here.

AMD Radeon HD 7970 with model, Eric Demers

Eric Demers, Corporate Vice President and CTO of the AMD's Graphics Division, sat down with Rage3D's James Prior at the recent AMD Southern Islands tech day, held in Austin, Texas. At the AMD Lonestar campus, demonstrations and presentations of Tahiti product were shown, and we wanted some more information on a couple of items. Another gentleman was in attendence, one of the main architects for the new design but our scrawled note of his name proved indechiperable. His name is Tom; we'll fill in the details as soon as we get them from AMD PR.

James Prior, Rage3D: I'd like to get a little more information on what you did on the front end, to improve set up from Cayman to this new architecture, Tahiti. One of the criticisms leveled, fairly or unfairly, is that the triangle setup rate isn't quite where it needs to be - it's a bottleneck. What do you think about that?

Eric Demers: The truth is we have improved the efficiency; we didn't improve the peak primitive and vertex rate, this part matches it (Cayman), but we've improved the efficiency - things like increasing the buffer that we use to store the results of vertex processing while we're rasterizing the pixels so we'll have vertexes; a lot of it has significantly improved the efficiency when running close to our peak speeds.

Eric: Now, the reality is I can't think of an application that comes close to our peak speed today, but they might have been hitting some of our efficiency limits, so those will be significantly improved in some cases with Tahiti. From that standpoint I think we're ok. Now, reality-wise, could do better? Y'know, I just don't see, at least in a lot of cases, pure vertex rates being that much of a bottleneck. Where I've seen more the bottleneck is on the tessellation rate, and that's artificially inflated by how people are using tessellation. I'm all for people doing it, like we showed in Battlefield 3, and in DiRT 3, that's really cool to me. Our own demo on the partially resident textures - it looks awesome to me to do that. But when you're doing like Crysis 2, where you just add triangles and you do nothing, that's just silly.

Eric: I think it's artificially pointing a finger just because our competition has done things a different way. But we've done things differently too, for example the [tessellation] performance differences from our top end to our bottom end are very similar, so you can use the same geometry in both those engines and just scale your fillrate. I actually think that's a better solution in general - it allows an ISV to develop one set of geometry and use that on multiple boards.

Eric: I think a lot of these bottlenecks are artificial and in some part caused by how our friends have gone and implemented code in other people's stuff. Having said that, this architecture, as you saw on the tessellation slide but it's also true for geometry, shading and a lot of things that are vertex bound, you'll see a significant increase in performance. You'll see that translate into much higher performance, at least in the high tessellation games, that have that type of front end performance. You'll see Tahiti do a much better job there but, like I said, though I think some of those limits might be somewhat artificial this architecture will be much better there.

Tahiti chip. Eric Demers licked it. :(

R3D: Now the different caches, you've got 32 compute units feeding into 12 L2s. Does that have a potential for a bottleneck where you've going to have a lot of misses because everything is trying to hit L2 at the same time or is that just not a reasonable workload?

Eric: Well sure, we'll get misses on the L2 and that'll cause bottlenecks, there's no way around that. But we have 50% more b/w and 50% more L2s this time around than before. Fundamentally we do believe that even though the CU count increased by more than 40%, we actually increased our bandwidth by 50% so we should, if anything, be better off than we were before. Also, the L1s are twice the size now, so from that standpoint the number of misses they'll have is going to be reduced as well. Overall, when we do miss, the memory bandwidth is now 50% higher as well so in general things should balance out to be better than before. I can't imagine any case where we'd be any worse. I actually can't imagine a case where we'd match; I think we're always going to be ahead, and well ahead, of where we were. The only exception would be if you were doing a lot of read/write but then if you're doing a lot of read/write we're going to blow our previous generation out because we didn't cache any of that, we always went through memory - now we're caching read/writes all the time. I actually think we'll have a multiple integer performance improvement for any kind of read/write activity. In general I think this new cache architecture is just going to be much better.

Eric: Graphics, having 50%-100% more (cache) and more bandwidth is going to help there, too. Compute is going to help even more, too.

R3D: The reorganization of the compute units, no VLIW5 or VLIW4, seems like it would be efficient for some of the older graphics workloads like DX9 games. Are we going to see an improvement in performance vs. regular uplift from clocks and process?

Eric: The fillrates have gone up, up to to 50% [more] as well, and the texture rates has gone up from 96 to 128 pixels, and then clock rates are higher, so you are going to get performance just right off the bat - right away about 20-30% just from our previous 6970, just kind of minimum numbers. If you had a DX9 application that were really shader limited then yes, we would expect them to run faster. Trying to think of examples, the 3DMark ones should scale a little bit more but in general what we've seen in DX9 apps is that the shader use is still pretty simple. I think they're getting pretty good performance uplifts over 6970, anywhere from 1.2-1.5x and maybe even some 2X in some cases. I think they'll run that gamut, you guys will have to run the tests, certainly if you're CPU limited then you may not get any uplift because you're not 'feeding the beast'. Generally if you're running low resolution, if you're running 1280 or even 1600, the benefits there are much smaller than if you're running 2560 or Eyefinity. If you're running Eyefinity this thing kicks butt. If you're running single monitor 1920 you're going to have to throw in AA to really start taxing/pushing the machine pretty heavily. Actually Cayman is pretty good at 1920 so this guy starts stretching its legs at that kind of level.

Tom: You start getting CPU bound among other things.

Eric: You can get CPU bound, it's also a new architecture for us so there are other bottlenecks we're going to address over the coming months so it'll keep on getting better. The heuristics it's using are fundamentally different from our previous architecture, there are easier parts and there are parts that we're still working on. There's a lot, in fact we've got some games that have gone up 5-fold in performance since we started, from August to now, in drive changes to take advantage of the architecture. Probably another 2-3 months for us will really allow us to showcase the card. In one of the presentations you saw that performance changes, typically 15-40% - in this generation you'll see bigger numbers.