"Oh man..."
Was probably a frequently heard opener whenever two ATi engineers met and started talking about the R600 vs. G80 battle. The R600 wasn't an incredibly bad chip (there have been quite a few of those in GPU history) but, for the amount of effort poured into it, the time it took to get it to the market along with its complete disregard for size, heat and power-draw (which might have been ok for a king-of-the-hill GPU) it was underwhelming. Not competitive with the 8800GTX, hampered by being seemingly overdesigned on some aspects (the 512-bit bus sticks out like a sore thumb) and underequipped on others, it was ultimately a failure - not due to some hidden mysterious bugs, but mostly due to bringing knives to a number of gunfights that no modern GPU should risk losing.
If you're wondering what's up with the history lesson, that's easy to explain: in order to gain perspective on today's hero and understand it wholly, one must grasp just what its predecessor, the R600, was. But what's with this hero we're talking about ... time to meet him in all of his silicony glory.
The RV770
You'll have to excuse the nudity ... chips tend to be show-offs prior to being bound to a PCB, and engineers tend to be proud when they have a great chip to photograph. What is there to be proud about you ask? Quite a lot in fact.
The RV770 is, to a certain extent, a consequence of the R600's lackluster history. The RV770 was designed from the get-go to be a very efficient chip, and to solve all of the problems associated with the R600 architecture, with virtually every ATi design team involved in the new architecture.
As an opener to our launch day coverage, we'll be taking a plunge into the RV770 and seeing what what is new, what has changed, what stays the same (quickie note: not much) and how it all fits together. We'll be following pretty much the same structure that Scott Hartog, chief architect for the RV770 series, used in his presentation of the architecture, as it nicely flows from one aspect of the new chip to another, using Richard Huddy's R600 architecture overview for comparisons and underlining the differences between chips. We'll also check and see if we can actually achieve the theoretical numbers using a number of synthetic tests ... don't worry, the whole gaming enchilada will soon follow, we haven't lost our focus.
In order to fully grasp the goals of the RV770, we'll take a snippet from Rick Bergman's, Senior Vice President and General Manager, presentation about ATi's new strategy:
It's obvious that the RV770 marks a turning point for ATi, the final departure from the "one monolithic GPU to rule them all" mentality. This very approach was probably the R600s undoing: things kept piling on during the design process, the chip kept getting larger, TDP was rising ... you can see where this is going. One could argue that the RV670 marked the beginning for this new strategy Mr. Bergman talks about, but we beg to differ: the two projects ran in parallel, with the RV670 being a refinement that allowed an early peek at how things would be tackled in the future, and the RV770 being built from the ground up to fit the new vision.
Looking at the last slide, one begins to realize that the goals for the RV770 generation were quite lofty: solid performance, accessible pricing, thermals that allow scaling the architecture upwards (the Ultra Enthusiast segment is important, in spite of its relatively low absolute volume - it's no good dominating the mainstream and eventually performance segments with a chip that's already too hot and too large for its own good, thus making it unfeasible for multi-GPU approaches meant to cater to that very Ultra Enthusiast mentality). Did it all materialize? Time to find out - here's the opening slide from Scott Hartog's presentation:

Now we have an idea of what to expect: a chip aimed at more than doubling the processing power of ATi's previous generation; one that addresses one of the biggest weak points of the whole R600 generation, namely AA performance, whilst being better suited for GPGPU tasks and maintaining and expanding the solid feature-set ATi introduced with it's RV670 tweak of the R600. Not quite a piece of cake, is it? To see how it was achieved, we're introducing the best (if you ignore the whole early-NDA lifting shenanigans) kept secret in recent GPU history, the Terascale graphics engine:
This is the part where we leave Kansas, Dorothy, as it actually becomes interesting and we can discuss the architecture. The big, and truly unexpected, bomb is the number of ALUs or, as ATi calls them, stream processing units. Most insider gurus were pegging the RV770 at 480 ALUs - which would've meant 6 SIMDs. Well, that rumor can be laid to rest now: it's a 10 SIMD part.
Throughout the rest of this article, we'll be covering all of the bullet-points on that slide in detail. The last slide shows the general configuration of the 4850 we're having as a representative of the RV770 line so, whenever in doubt, check it out to see its specs again.
Comparing the RV770 with the R600, we notice that the SIMDs have taken a 90 degree turn, going from being vertically oriented to horizontal, with each SIMD being tied to and serviced by its own discrete Texture Unit (TU), as opposed to the R600 where the TUs were tied to "quads" of 4 Shader processors within the SIMDs (4 "quads" per SIMD sum up to the 16 SP/80 ALU arrangement we know), and then to the Shader Core in its entirety. This is a clever optimization that will allow future scaling of chips by adding SIMD+TU dynamic duos whilst keeping the 4:1 ALU:TEX ratio that ATi seems to favor. Keep in mind that this is a 1:1 relationship, with each SIMD:TU pair being able to communicate with another only through the 16 KB data share (more on that later). We'll look closer at the SIMDs in just a bit, after extracting some more interesting tidbits from the general diagram.
Distributing the Cache
Also notable is the change in the way cache is arranged. In the R600 we had an unified L2 texture cache, with a reported 256KB L2 for the 2900XT. The RV770 moves to a distributed arrangement, with each TU having its own discrete L1 cache, thus we get 10 L1s, that goes through a crossbar in order to communicate with 4 discrete L2 caches, each one of these being dedicated to one of the 4 memory controllers (MCs). So, if our understanding is correct, the relationship between the TUs goes from many:one in the case of the R600 to one:one with the RV770, whilst the relationship between the L1 and the L2 goes from one:one to many:many.
The Z/Stencil and Color caches remain similarly arranged, with each RBE getting getting its own discrete chunk of cache; the R600 diagram is misleading in this respect, the R600 had its Z/Stencil and color caches similarly distributed.
We couldn't extract the exact size of each cache from ATi, but they did say that overall there are about 5MB of SRAM on the chip, and that they're using a custom design for the SRAM, which is another explanation for the high density they seem to be achieving with their chip - a 260mm2 chip with 956 million transistors and the number of functional units the RV770 has is quite an impressive feat.
Also, if you look carefully, you'll note that the read/write cache from R600 was deleted. ATi discovered that it was not large enough to be effective - and the effective size was prohibitive in silicon area. The RV770 has a write combining buffer for stream computing outputs, and the input data for stream computing now comes in through the texture/vertex data path.
One final aspect can be gleaned from the high level diagram of the RV770, and that is the absence of the ring-bus. More on that in just a bit, but it was worth mentioning. Now, let's have a closer look at those surprisingly numerous SIMDs.
Since batch size scales with the number of ALUs per SIMD, which remains pegged at 80, we're still looking at 64 pixels/vertices per batch.
Peeling the onion further and looking at an individual SP, the arrangement is similar between the two architectures: still 5 ALUs, with one of them (the "fatty" one) handling transcedentals like SIN, COS, LOG, EXP. An addition to the ALUs is the fact that all of them can do integer bitshift operations now, whereas in the R600 only the above mentioned fat-guy could pull it off. This is another aspect that won't necessarily turn Crysis into a scorcher but will be appreciated by the GPGPU crowd as well as video processing, and since it probably wasn't all that expensive to add the functionality it made sense for ATi to do so. Each ALU retains per clock MADD capability.
Double precision is new compared to the R600, but not so when compared to the RV670 (which proves that the former wasn't simply a shrink). Doubles seem to be handled at 1/5 the single precision rate (240 GFLOPS=1200/5, in case you're wondering, so the number in the slide is based on the higher clocked higher perHD 4870).
Finally, another noteworthy aspect is the fact that the SIMDs themselves are quite a bit smaller than they were in prior architectures, which makes sense when taking into account the efficiency mantra that dominated the design process for the RV770. We inquired about how exactly this was achieved, but the only answer we got was that it was a consequence of looking at each and every aspect of the chip (and thus also of the SIMDs), and identifying all possible savings that could be achieved. Sounds easy, doesn't it? It most certainly wasn't,
To close our dealings with the SIMDs for the moment, we should actually check them out in practice. To do that we'll be using the Rightmark3D 2.0, made by the guys over at Digit-Life, as it's one of the few DX10 tests available, and it also has two mainly ALU shaders to test (Mineral and Fire), so we can actually see how ALU power scales with the 2.5x increase. You'll also like to take a close look at how the Fire procedural shader behaves, as it's not exactly one that plays nicely with the SPs in ATi's architecture: it has a weighing function that does quite a bit of scalar math, a gradient calculation made up of MULs and transcedentals that's quite demanding, both of which are run 32 times to generate noise, which is afterwards fed to a small loop over scalar floats to accumulate. This has been a troublesome shader for quite a while, but a few driver releases ago the compiler seems to have properly mastered it- which is impressive and goes to show that the general reports of doom an gloom associated with the VLIW nature of the shader core are somewhat overstated. Here are the numbers:

For the RV670 numbers we're using a stock clocked HD 3870 card, which means that, doing a quickie calculation factoring in clock speed and number of ALUs, we should be seeing roughly twice the math performance from the RV770 (600MHz*800ALUs/775MHz*320ALUs)-and we are. There's very little texturing going on in both shaders so the influences of increases in that area should be minimal here. All in all, we're getting what we should be getting in terms of math power increases, and going by our experience, extracting the full Teraflop from the RV770 is not hard at all.
The texturing capability of the R600 was rightfully considered unimpressive. Arguably, it was one of the architecture's main weak-points. It was also a most speculated upon trait, with forumite predictions floating around either maintaining the same 4 TU count (which would've been a really bad idea, at least as far as we're concerned), or doubling that to 8 TUs. Well, ATi decided to surprise everyone here as well and opted for having 10 TUs, which fits nicely with the new arrangement of the SIMDs and the adjustment of the SIMD-TU relationship.
The TUs themselves are changed from what we used to have in the R600:
Whilst the R600 could setup 8 addresses per clock, out of which only 4 values could be filtered, the RV770 does away with the extra addressing capability and maintains a 1:1 ratio between Texture Address Processors (TA) and Texture Filtering Units (TFU). This that we go from having 32 addresses per clock in the R600 to having 40 of them in the RV770. The reduction in texture addresing oomph (hah) per TU isn't exactly a loss, given the fact that the extra capability found in the R600 was only useful in non-filtered scenarios, which are hardly ubiquitous (yes, we're taking into account Vertex Texture Fetch). It could be said that the R600 had too much addresing capability for what it could normally take advantage of, which probably made the TUs larger than needed.
The number of FP32 texture samplers also takes a step-back (logically, since they were also relevant only for non-filtered situations and tied to the extra addressing capability somewhat), going from 20 per TU to 16 per TU, but given the 2.5x increase in number of TUs that nets the RV770 double the texture fetching capability, at 160 texture fetches per clock (the R600 could do only 80).
An interesting tidbit we're is that 64-bit filtering capability has gone from one value per clock to half of that-which means 2 clocks per bilinear filtered 64-bit color value. This was done because having full rate 64-bit filtering consumed quite a chunk of silicon, and internal analysis showed that perf/W and perf/mm2 would increase if the filtering rate was reduced and thus transistors were saved, allowing for more TUs to be implemented overall. Going by the quoted performance numbers and what we've seen in practice in terms of 64-bit filtering, this was a correct engineering choice.
Remember those caches we talked about earlier? Time to take a closer look at them. Now, this will be mostly speculative, as ATi has been secretive about exact size of the caches. We know that the R600 had 128KB of total L1 cache, with each TU getting 32KB. Going by the slide, that would suggest that the RV770 has 64KB per each individual L1, which sums up to 640KB total L1 cache-reason for joy because 640/128=exactly 5, namely the suggested overall increase. We've also managed to roughly measure the L1 cache size through testing, and it lines up with the 64KB value quite nicely. This is where we go guessing that the ratio between L1 and L2 is maintained from the R600, which would mean that the L2 should be 2X the L1 (it was 256 KB in the R600). This would mean that the RV770 has 1280KB of L2 cache in total, split in 4 MC aligned banks, each being 320KB.
What is certain is that bandwidth has been significantly bumped both between the TUs and the L1 (480GB/sec) as well as between the L1 and the L2(384GB), double that of the RV670(and by extension R600) which as Mr. Hartog himself underlined in his presentation was necessary in order to ensure that the chip wouldn't run out of internal bandwidth.
Since some of you are certainly wondering about it, it should be mentioned that Anisotropic Filtering is the same between the R600 and the RV770, so it still has a slightly more pronounced angle-dependency when compared to competing solutions.
We've tested the RV770's fillrate with a number of apps,
and always came up with numbers in the (19,20) GigaTexel interval...too low for
a 10 TU/40 TA&TF part-should be near the 25 GigaTexel mark (625*40/1000). For
all those of you who cried "FOUL!", relax for a bit- we actually went and asked
ATi why we were getting this behavior and they explained that in spite of having
10 TUs on chip, there are only 32 interpolators, so the INT8 bilinear
rate will be limited by that. Once you get filtering limited, with Anisotropic
Filtering, or by filtering FP textures, the architecture should behave like a 10 TU one. We've verified this using FP texturing, because that has also allowed us
to check the assumption that the RV770 does FP filtering at half-rate:
pics/rv770/tables/texels.jpg
We hacked an older fillrate testing application in order to vary the FP format being used. Due to the nature of the test and some of its inbuilt inefficiencies we're getting a bit under theoretic figures, but the numbers correlate nicely with the above: the INT8 bilinear fillrate is limited by the 32 interpolators, the FP filtering modes run at half speed, and their performance indicates a 40 TU(and thus 40 TF) chip, as the rates were seeing there would be a bit too much for a 32 TF one. Thanks goes to Rys from B3D for pointing us in the right direction.
This arrangement was also determined by internal ATi
testing: the interpolation rate can become a limiting factor only if you have a
shader doing a lot of texture fetching and not much else, but that's hardly a
scenario that's likely in a real-world setting. However, it will show up in
synthetic tests (like the above).


The RBE were another primary focus for improvement, mainly in order to increase AA performance, another R600 weak-point. Gone is the dedicated alpha/fog unit-it's use was rather questionable in terms of benefits anyhow. Depth and stencil capacity is doubled, which means that the RV770 is actually capable of Quad-Z/Stencil rates, without AA (the R600 was capable of Double-Z/Stencil). We've tested this with Archmark 0.50:
A few things to note here. The RV670 has a higher Color-fillrate due to its higher core clock(775 vs 625). Most other fillrate testing apps, asides from Archmark, fail to expose Quad-Z for the RV770 in no AA scenarios, producing erroneous 2.5X-2.6X rates(too low to match what the architecture should be capable of, too high for a Dual-Z solution). On the other hand, Archmark tends to overestimate Z-Only fillrate for the RV670 (that's why the little asterisk is there). Stencil-only numbers are in line with what both architectures should be achieving. Correlating these numbers with the ones we'll be showing you in just a bit for AA scenarios, we'd lean towards considering the numbers Archmark produces for the RV770 correct.
Z and Stencil compression rates are the same from the R600, so still at 16:1 with no MSAA and 128:1 with 8X MSAA.
Having gotten that out of the way, time to tackle the AA question, one that has been asked repeatedly. As is obvious from Mr. Hartog's presentation, the most significant RBE improvements are centered around this area: the RV770 handles 4AA samples per clock and is thus capable of outputting 16 pixels per clock even with 4X AA, whereas the R600 could only manage 2 samples per clock, which resulted in it outputting only 8 pixels per clock with AA enabled. This means an overall doubling of fillrate with AA compared to the R600. As a bonus, non-AA fillrate is also doubled for FP64 (16 bit per component) color formats going from 8 pixels per clock to full 16 pixels per clock.
Many of these improvements are tied to the redesign of the Color Blender block (CB), which took away unnecessary functionality and allowed the aforementioned doubling of AA rates. Here, the main design focus was on significantly increasing performance rather than on reducing area.
We checked to see how color and Z fillrates vary with number of AA samples using Mdolenc's Fillrate Tester (seems to produce accurate results under such circumstances):

The contrast with the RV670 is fairly striking-we should be expecting very impressive AA performance under real-world conditions, based on what we're seeing here, The behavior of the RV670 with 2X AA is surprising though as it should in theory be capable of taking 2 AA samples per clock. Looking at the Z numbers, we can see that the RV770 actually does Quad-Z whilst the RV670 is only Dual-Z.
The sampling patterns for AA remain the same, which you can see in our AA&AF investigation. We'll be having a more in-depth look at image quality quite soon, but for the moment you can check those out and know that the RV770 uses the same. Note that DX10.1 actually allows the application to control the sampling pattern being used if they so choose. We'll also be looking at Edge-Detect CFAA, which has received significant improvements in terms of performance and compatibility- before that, try it in your DX9 games, you'll probably end up liking it quite a lot.
A final point to touch upon is resolve-is it fixed(was it broken?)? Well, to the glee of many, the fixed, "box" resolve is handled by dedicated hardware in the RBEs. The dedicated HW can run at full rendering rate, but tends to be memory bandwidth limited as it's a fast, simultaneous read/write operation in memory. Instantaneous resolve rate is also affected by the amount of fragmentation within a pixel, but the average number of fragments over all pixels is almost always very close to one.
Moving on to ATi's proprietary CFAA (yes, this is a proprietary implementation that relies on cooperation between dedicated hardware in the RBEs and in the Shader Core), it also has been overhauled on the RV770 compared to the R600, with performance being much better. For the moment, CFAA is a DX9 and OpenGL affair only, with DX10 support available very soon.
Thought all of the big surprises had been exposed, didn't you? Well, a major one is yet to come: the ring bus is gone!
One of the most unique characteristics of ATi GPUs for the last 3 generations is no longer included. Does this spell doom & gloom? Probably not. Look at the slide from Mr. Huddy's presentation and at the benefits it mentions - all are design simplification related and none regards performance. If the layout issues are resolved, a crossbar setup has better performance/area when compared to a ringbus, with loads that are quite evenly balanced. The ringbus also presented serious challenges related to avoiding stalling or deadlock conditions, which implied a certain amount of overdesign. Looking at those characteristics, they don't quite fit the focus on efficiency that was at the core of the RV770 design process and, ultimately, considering the near superlinear fashion in which chip area is increasing, as opposed to the linear manner in which the area required for a crossbar, the very reason that brought about the ringbus (increased difficulty of adding wire-channels to a chip) becomes non-existent.
The new architecture has fully distributed memory controllers with the high bandwidth consumers like RBEs/Caches right next to them. The goal of this arrangement is to increase efficiency of the memory (these increases have been measured under real world conditions in ATi's labs in order to validate the entire idea), and is actually another reason for the stellar performance the chip has with AA enabled...performance you don't yet know about (from us at least) as the performance investigation goes online a bit later today.
Alongside the memory controllers, you have a hub that's destined to handle low bandwidth communication (as you can see from the slide). The new Memory Controller has also been engineered to play nicely with GDDR5, which is quite a jump on all accounts when compared to prior GDDR types. We'll talk more about GDDR5 once the 4870 gets here and we'll actually have material support for what we'll be saying, but a minor note should be made lest you become confused: you might see GDDR5 frequencies being quoted as 900MHz - this refers to internal RAM chip clocks, but communication still takes place at 1.8 GHz with data transfers occurring both on the falling and rising edges of the clock signal, which equates to 3.6 Gbps/pin, the way ATi would like you to think about it, or the 3.6GHz figure you'll see more frequently quoted.
We've barely scratched the surface of the little wonder that the RV770 is, but rest assured that we'll look into it in even greater detail throughout this day (be sure to check back at regular intervals as we push out the next articles), and in the coming weeks as we grow more familiar with the architecture. There are a few things left to touch upon before closing though:
The presentations we've seen alluded to a very significant increase in Geometry Shader performance, and, more specifically, geometry amplification performance due to increased on-chip storage capacity for GS-generated data and supporting 4 times more threads in flight. We've done a bit of in-house testing and managed to get quantifiable GS performance increases. So that you can check for yourselves, here are some RightMark3D 2.0 Hyperlight numbers (the only publicly available test that has chances of actually being GS bound, by its use of amplification, in spite of the fact that it does far too much texturing for its own good-it's actually probably texture limited on the RV670). The both Geometry and GS load were set to high:

GPGPU is an important emergent area of application and, dare we say, battleground. The RV770 brings significant increases here as well, but the subject itself deserves to be treated separately so we'll defer looking into GPGPU to a later date.
Going by the wording in one of the presentations we have, the tessellation unit has been improved from the one included in the R600/RV670. This is yet hard to verify in practice, but stay tuned as new, interesting developments in this area are bound to happen far sooner than you'd expect (yes, games will start taking advantage of it, as surprising as it might seem to some)
In another slightly unexpected move, ATi opted to implement a hardware solution for power management, based on an on-chip microcontroller that, going by the figures ATi quotes, does a rather great job. Check out the slide underneath for slightly more details:
This concludes our preliminary look at the RV770 architecture. There are many things left to say and find out (we hear Beyond3D's sexy overlord Rys might have a thing or two to say about the topic, so be sure not to miss that as Rys' work is always great). Later today we'll be showing you how all of these improvements pan-out in real life scenarios, and whether or not the 4850, the first RV770 representative that made it into the lab, is or isn't the little chip that could. Until then, we're going to leave you to look at what is referred to, in highly academic circles, as "geek porn":
Ready to enjoy the RV770 experience yourself? Don't miss on your chance to win a Radeon HD4850 Crossfire combo, sponsored by ATi!
Rage3D wishes to thank ATi's Scott Hartog, Dave Bauman, and the rest of the ATi Team for their support, without which this series of RV770 articles might never have come to fruition.