“Who are you gonna call? GHOSTBUSTERS!!” ... err, wait, that's wrong. We don't have any ghost issues here ... we coexist in blissful harmony - but there's that GPU thing, the RV770, that ~260mm2 chunk of silicon that took the world by surprise - figuring that one out is hard. So who are we gonna call to get answers for our myriad of questions and bring peace to our inquisitive minds? Why, ATi's Eric Demers of course!
Eric Demers is a Fellow at AMD, in the graphics products group. He’s been at AMD, and before at ATI, since the Art-X acquisition in 2000. He’s been involved in the design and architecture of the Radeon products since the R300. He’s also been involved in console work and AMD’s future Fusion products. An industry veteran of 18 years, who has worked at other companies such as Silicon Graphics, S-MOS and Matrox. He was also chief architect for the R600, which can safely be considered the RV770's granddad.
What the above paragraph fails to illustrate is that Mr. Demers is an extremely nice guy ... and an exquisitely intelligent one as well. If you're not humbled after discussing GPUs with him, you're probably one smart cookie. We know we rushed back to the book after ending the 1 hour long conversation and, as you'll soon see, had the opportunity to apply a few face-palm maneuvers on ourselves.
Without further ado, here are the words of sireric himself:
Rage3D: Hello Eric, it's a pleasure to finally meet you!
Eric: Well, we are only meeting on the phone…
R3D: It's a step forward from exchanging emails.
Eric: Oh, comparing to that it is indeed a step forward. How shall we proceed?
R3D: The plan is to start with the RV770 and then move into the R700, if that's ok with you?
Eric: That's fine...we can alternate topics as well, no need for a fixed arrangement.
R3D: Great...that means that we can open the RV770 discussion with a question about the R600 (:P): would you say that the R600 was the result of mis-predicting where the market would go? Was the focus on future-looking features like full rate 64-bit filtering wrong?
Eric: If you look at the RV670, which was pretty much a shrunk R600, that was a very successful part- so the underlying architecture couldn't have been all that bad, could it? If we got something truly wrong with the R600, that would be the process-node: the chip simply ended up being too large and too hot on the 80nm process we used. Had we known at the time when 65nm was coming out, we would have opted for that instead. As for full rate 64-bit filtering, the R600 was geared towards using FP formats- we had the bandwidth for it, TUs were engineered for that - but usage rates weren't as high as we expected. So this is something that we changed around in the RV770 (going from full rate to half rate filtering for 64-bit).
R3D: In hindsight, do you still think it was a good idea to go with a superscalar arrangement for the Shader Core, instead of the scalar approach your competitor employs?
Eric: Actually, it's not really superscalar...more like VLIW (at this point the interviewer applied a face-palm maneuver for asking a silly question). And yes, I still think it was a very good idea. Typical graphics workloads map really well to our architecture. I'd say extracting full performance from our SPs was never an issue. You could construct some highly serial code where our arrangement would be at a disadvantage, but that's a purely theoretical case that never happens in practice. Will we use this arrangement forever? I can't answer that question, and we're always investigating ways of improving things, but you'll have to wait and see how new architectures from us will look.
R3D: How would you rate your compiler's evolution from the introduction of the 2900 up to today? We've seen traditionally difficult scenarios improve significantly in recent times- are the compiler guys finding new and improved ways of handling certain types of code?
Eric: Our compiler is constantly evolving and improving, and the team behind it is always finding new, better ways of leveraging the underlying hardware. You know, the team isn't that large...I think we have under a dozen guys working on it, but all of them are very very good. They're always finding innovative solutions...
R3D: They must be a merry bunch, staring at code all day long.
Eric: :laughs: I don't know about that. I know their boss and he's a great guy. I think that it's a challenge for them to find new, superior solutions for the compiler.
Somebody set up us the bomb
R3D: Since we're in the neighborhood, this is a good time to ask about GPGPU. Are you fully focused on OpenCL?
Eric: We're strong supporters of OpenCL. We've worked closely with Apple, we're members of the Khronos board responsible for OpenCL. We're always going to be focused on open standards that move the industry forward ... relying on proprietary solutions isn't an adequate long-term solution. OpenCL holds a lot of promise.
R3D: That's an elegant way of saying that you won't support CUDA, in spite of nVidia's claims that you could/should/must. :)
Eric: Irrespective of what nVidia says, the truth is that CUDA is their proprietary solution, which means that if we were to use it we'd be stuck being second place and following their lead; that's not really a position we want to be in. Another characteristic of CUDA is that it's very G80 centric, so first we'd have to build a G80 in order to have a reason to use CUDA. So, no, we're not going to support/use CUDA in any foreseeable future.
R3D: Getting back to the hardware side of things, what was the secret to achieving seemingly impossible size savings with the RV770 (it's too small and yet too big)? Did you do significant layout/design tweaking “by hand”? Are you using custom design for significant parts of the chip?
Eric: The secret was good engineering and a lot of it. We spent quite a bit of time reviewing our previous products (all of them) and seeing were we could improve both raw performance and performance per mm2. We spent a lot of engineering effort redesigning blocks to make them more efficient and smaller, using all we had learned to achieve that. Also, we re-balanced compute and BW, to achieve a more balanced ratio of capabilities to bandwidth, more inline with current applications. We also changed the memory interface configuration, going for a more tuned per channel/client organization for high bandwidth clients.
Also there was a significant amount of layout work done to achieve the small dies we did. Nearly a year before we sent the chip out for fabrication, we started our floor planning work and physical design work. The last months of the design are spent solely on physical design and achieving our projected area targets. While there is quite a bit of custom work done for all our chips (for example, all the I/O), the core design was a standard cell design for the logic section, but with custom memories to optimize area.
Finally, there is some magic too. It's the dedication of the engineering staff working in the labs, working on design, working on the drivers and software, from the most junior engineer to the chip architect; these guys worked amazingly hard and delivered the best product we've probably ever have and we believe to be the best product in the industry today. I'm very proud of all of them and their accomplishments.
R3D: How did you change the Ultra Threaded Dispatch Processor for the RV770? I presume it's been widened due to the 6 additional SIMDs, but did you retain the dual Arbiter/Sequencer arrangement that the R6xx had? Do vertex and texture fetches still have one arbiter-sequencer pair dedicated to each, thus making it possible to schedule them independently from math done in the SIMDs? Were the command queues increased?
Eric: The dispatcher was changed through nearly 1.5 years of work from a design team standpoint. It was made to be much more scalable in design, allowing for the additional SIMDS, while also offering new features and better performance. It's an evolution of the previous version and inherited the best parts, like as the ability to issue to multiple blocks in parallel such as texture and ALU and others. As for the command queues, some tweaking and optimizing of sizes was done to achieve better balanced in the new design.
R3D: What's the triangle setup rate for the RV770? The R600 was 1 tri/clock, but the RV670 made a step back to 0.5 tri/clock, as far as I know, but you didn't detail whether the RV770 behaves like the former or the latter. Additionally, do you regard triangle setup as a limitation that has become stringent in recent times, and that has to be addressed in order to allow better scaling between architectural updates?
Eric: All of our parts have peak rates of 1 prim / cycle (this is the point where the interviewer applies the second face-palm of the day). Even the RV610 in some conditions, was able to do that. In fact, in tessellation demos, both the ATI Radeon™ HD 3870 and the ATI Radeon™ HD 4870 have been shown to sustain rates of over 600 MTris/sec, if not more. But there are many factors that determine primitive rates, including BW, so actual performance may vary significantly from one part to another depending on what apps do. As for limitations, certainly there are cases, such as tessellation or short vector lines, where the setup rate is a bottleneck. But it is less frequent than other limitations, at this time. We do expect primitive rates to increase as people switch to using more fine tessellation and displacement surfaces. As for future changes, well, you'll have to see.
R3D: On the topic of tessellation, when are we going to see the tessellation SDK and/or documents detailing how to leverage the tessellator? Those are pretty much non-existent in public form as far as I know.
Eric: Hmm, that's an interesting question. The tessellation SDK is not yet done, and there's no firm date for its release at this point in time. We're working with interested developers directly. Driver support is also not yet completely finalized, we plan to introduce it in the November 8.11 Catalysts. So you'll have to wait just a little bit more before you can start playing with it.
For Great Justice!
R3D: How is the Edge-Detection part done for Edge-Detect AA? Is it purely analytical, e.g. based on applying a Sobel/Canny-Deriche filter, or using differential edge detection, or do you rely on Z-compares/checking whether a tile is fully compressed or not?
Eric: We don't want to reveal too much about our algorithm, it's a secret sauce of sorts. But in a nutshell, it's a mixture of hardware and software, and it uses a technique to do coarse edge location and isolate the regions of the screen that contain edges, and then applies a filter on these areas to do fine location of the edges. Then an adaptive kernel filter is applied for the resolve that is location and edge aware.
R3D: On to a rather burning issue: looking back, do you still think it was a good idea to setup the RV770 cooling solution as you did, with a focus on silence rather than thermal performance?
Eric: We went through a lot of feedback we received, as well as further investigating the topic, and it turns out that the majority of users are far more sensitive to the noise associated with the cooling solution, than they are to GPU temperatures, which led as to aim for the most silent possible operation. Considering that, for stock operation, the cooling solutions we have implemented are more than adequate in terms of thermal performance, as they keep the chips well within specifications- we rate the chips to work up to 105o C which is quite a bit higher than the temperatures you'll be normally seeing. Of course, if you're considering significant overclocking, you'll probably be better served by the alternate cooling solution that our AIB partners bring, as some of those are better suited to such a task than our reference design.
R3D: Is the fact that both the ATI Radeon HD 4870 and the ATI Radeon HD 4870 x2 downclock the core to 500 MHz, as opposed to the ATI Radeon HD 4850 which goes to 160, caused by the fact that those two use GDDR5?
Eric: The GDDR5 ATI Radeon HD 4870 boards are tuned to operate with higher memory and core speeds to get the highest performance, as compared to the ATI Radeon HD 4850 boards. As a result, they are currently more limited than the ATI Radeon HD 4850 GDDR3 boards in terms of their ability to operate at scaled down clocks when idle. It's a result of multiple constraints, but nothing inherent in the GDDR5 protocol. However, we are working on ways of improving the range of clock speeds we can support with GDDR5 boards, so we can further reduce idle power without affecting peak performance. Currently, the ATI Radeon HD 4870 boards have an idle power in the typical range for a performance board.
R3D: Further elaborating on this topic, where does the difference between the 4870 and the 4870X2 stem from? The X2 downclocks to 507/500 at idle, whilst the 4870 does not downclock the memory at all ... since they're both GDDR5 parts, it can't be GDDR5 related, can it?
Eric: Lower speed GDDR5 modules are in the works, so it's not an inherent GDDR5 limitation. Having said that, the trouble with GDDR5 at clocks below the 500MHz mark is that you have to shift the operating mode towards a more GDDR4-like one. This requires some software work to be done. With the 4870 we didn't do it since other things took precedence, and because we were already getting good thermal and power characteristics. The 4870's power draw is in line with what you'd expect from a performance part. On the other hand, for the 4870X2 we had to deal with having what is practically 2 4870s on the same PCB, with an extra 1GB of RAM, so we implemented a more aggressive downclocking. With that in mind, we are looking at changing the way the 4870 behaves and having a similar clocking strategy for it, but that requires implementing certain changes in the software stack, and it's not the primary priority currently, since the card has characteristics that are within the envelope that is specific to the segment it targets.
R3D: Now that we've started touching sensitive topics, can we find out what's up with the interconnect? Why is it disabled? Are naysayers correct in stating that it's up to the AIB whether or not the traces for it get built into the PCB, and that it's likely to not be enabled?
Eric: The sideport interconnect is fully functional in the reference design. Though we've found that with the current AFR mode of multi-GPU support, the additional bandwidth brought by the interconnect does not translate to a significant improvement in performance. However, we are continuously working on optimizing how our ATI CrossFireX™ technology scales and trying different methods, and could decide to enable the sideport if a method is found which gives better results and benefits from it.
R3D: So what happened? How did a good idea stop being a good idea? Are you disabling it due to heat/power concerns?
Eric: It's still a good idea...you have to consider that we've been working on Crossfire and on our AFR implementation for quite a few years, optimizing it. It's actually gotten to a point where it's really quite good, and difficult to improve upon performance-wise. As for heat and power concerns, that's hardly the case...the interconnect is pretty much an extra PCIE2.0 link. The power draw for that is in the realm of a few watts at most. The only reason for which it's not enabled currently is, as I've already mentioned, that it does not significantly impact performance ... and if you're not using something there's no point in enabling it.
R3D: What kind of data can be exchanged via the interconnect? Is it a faster/lower latency avenue for transfering non-frame data like RTs that must be copied from frame n to frame n+1 and so on via a p2p write in normal CF configurations? Is this one possible application of it?
Eric: The interconnect offers the same features as the PCIe interconnect, plus it allows for a GPU to broadcast memory writes to its own memory and to the other GPU's. The data exchangeable is any data that the driver would desire. The latency versus the on-board PCIe interconnect used by the ATI Radeon HD 4800 X2 series isn't very different, but it's generally better than that offered by using the northbridge. Nevertheless, the static latencies I'm discussing here generally don't matter that much, and bandwidth is the key factor.
R3D: Could the broadcast writes function be used for avoiding swapping a persistent RT over the PLX switch? So instead of GPU2 waiting for GPU1 to finish working on it and then wait for it to be swapped via a peer-write, you'd have GPU1 updating the RT both in its memory and in that of GPU2 so GPU2 could start working right away?
Eric: Yes (pause for dramatic effect). That was one of the reasons we have it there. Keep in mind that it does not mean that this approach would be necessarily better than what we're normally doing in AFR.
R3D: Are you considering using a more extensive/granular SuperAA implementation, for example, 2X AA being done by having the 2 GPUs render the frame with no AA and jittering the frames in the compositing chip, 4X AA has the 2 GPUs render the scene with 2X AA etc. (an accumulation-buffer like approach, if you will)?
Eric: I'm not going to comment on future algorithms that we might implement for multi-GPU. Let's just say that we believe that we have the highest quality and performance available today, and that we have many options available to us down the stream. We already support SuperAA resolution when doing multi-GPU, which allows for higher quality and higher number of samples.
All your base are belong to us
R3D: Are you considering alternate ways of scaling the load to multiple GPUs, beyond AFR?
Eric: We are constantly evaluating all our algorithms, including multi-GPU. But there's nothing that we would want to comment on at this point, except to state that we believe that multiple-GPUs are inherently a strong feature in AMD's arsenal, and that we plan to continue to support it and improve it over time
R3D: So do you still consider AFR a viable solution once we go beyond 2 GPUs? That brings the cost of increased input lag, and also assumes that one can extract up to 4 frames with limited to no inter-dependencies for good scaling.
Eric: It largely depends ... the more you shift towards being GPU bound, by increasing resolution and AA and AF, the more benefits you'll see from such solutions. If you're going to game at 12x10 or even 16x12, 4-GPUs aren't really what you'd need. But at 25x16 you'd really feel the difference. So yes, AFR is a viable solution under those circumstances.
R3D: What is the feedback you've been receiving from developers with regards to coding in an AFR-friendly way?
Eric: The feedback we're getting from devs is actually quite positive. They're starting to take AFR into consideration when coding their games, so I'd say that moving forward things are going to improve for multi-GPUs.
R3D: So Crysis is the oddball title? Since that game is a rather AFR unfriendly beast.
Eric: I can't talk about what Crysis is doing exactly, since I haven't looked at the application engine itself ... although Crossfire does get good scaling in it.
R3D: For 2-GPUs, beyond that it's quite bad
Eric: Yes, for 2-GPU solutions. So, as I said, I can't comment on what Crysis is doing since I don't know that exactly. But with Crysis you have to consider that it's such a demanding application, on a system-wide scale, not only at the GPU level. It's CPU-bound in many scenarios, it also uses quite a bit of RAM ... and the amount of work that went into making it so demanding and impressive was quite staggering. Now, could the application scale better, as a result of combined work from our driver guys, our dev-rel team and Crytek? That's possible...although I'm not sure that Crytek are doing any more work with the original Crysis, I think they might've moved on to the expansion.
R3D: What is your take on micro-stuttering?
Eric: Micro-stuttering can be caused by multiple things. For example, for our previous product, the ATI Radeon™ HD 3870, one of the causes of micro-stuttering was due to the fact that the graphics clock was being increased and decreased too frequently, during games. The ATI Radeon HD 3870 was one of the first AMD parts to introduce a programmable micro-controller to monitor and control the chip power through clocks and voltage. The ATI Radeon HD 3870 was able to detect times when the application was not using it, and reduce its clock speed to conserve power. What we found is that within a single frame, when the CPU load was high, there were times where there was enough "starvation" to cause the ATI Radeon HD 3870 to reduce its clock, even though it was running a game. When the next part of the frame came up, the graphics clock had already been reduced, so that the rendering was slowed down until the chip detected a heavy load and resume high clocks. This up/down on the clock saved power, but reduced overall performance and cause micro-stuttering. This was fixed in February with a driver that taught the chip how to behave in this kind of situation (don't drop the clock in the middle of a 3D app). However, for the ATI Radeon HD 4870, we changed how the micro-controller worked from the beginning, to make it monitor multiple "windows" in the chip (both frequent changes per frame, and long term changes over multiple frames), and take appropriate action. This allowed the ATI Radeon HD 4870 to launch with all the power gating fully enabled and no stuttering due to clock changes.
There are other potential sources for micro stuttering, some of them, for example, having to do with moving memory around, which can cause either blackouts for the CPU or the GPU. Others exist when the CPU and GPU are more unbalanced (fast GPU, slow CPU), for example where the CPU will not generate any frames for a while, then generate 2 frames. It could be that in that case, we get an average time for frame 1 which is the idle time plus render, while frame 2 will be only render. That could lead to 16ms and 1ms frame times, which would appear as stuttering (assuming 15ms idle, 1ms render times). Multi-GPU makes the problem worse, as the GPU consumption rate is even higher. We are investigating these and others, though it's a tall task to fix all of them while also achieving peak performance.
R3D: Thank you for your time Mr. Demers! It's been a great discussion!
Eric: No problem, it's been fun!
content not found