Product: BFG 8800 GTX and XFX 8800 GTS
Company: nVIDIA
Authour: Mark "Ratchet" Thorne
Date: November 27th, 2006
G80 Architecture

Looking at the G80 Architecture diagram below you’ll see no mention of things you’d normally expect to find in abundance on a typical GPU diagram like pixel or vertex shaders. What you see instead are eight clusters of little green and blue squares that represent the innards of the thread processing clusters. Each green square, 16 per cluster for 128 in total, represent what NVIDIA calls a “Stream Processor”, which is a generalized floating-point unit that can operate on pixel, vertex, geometry, or even physics operations.

These stream processors, which unlike most of the rest of the chip run at an astounding 1.35GHz, are at the heart of the first DX10 graphics card on the market and the first to feature a fully unified shader architecture.


mouseover this image for a closer look at the thread processing cluster

As much as it feels like it, the concept of a unified shader isn’t new. In fact, Microsoft’s XBOX 360 gaming consoles has a graphics core that was designed by ATI some years ago which is based on the principals of the unified shader.

Shader unification is all about efficiency. In a traditional GPU, one with discrete pixel and vertex shader units, there exists a real possibility that for any given scene either the pixel shader engine or the vertex shader engine could go underutilized, leaving hardware idle and affecting potential performance.

In a scene with a lot of pixel shader effects and few vertex shader effects, for example, the pixel shader units are working at or near 100% of capacity while the vertex hardware might be doing very little. In a vertex heavy scene, the opposite would be true. Idle hardware is wasted hardware, which is why chip engineers from all fields work to make sure hardware is working as close to 100% capacity as possible. With discrete pixel and vertex shader units and the way games are designed it’s very difficult to make a GPU work as efficiently as possible.

With an unified shader architecture the GPU can perform vertex or pixel calculations in the same units. If a scene has a lot of pixel shader effects and few vertex shader effects then most of the units would be used for the pixel shader effects while only a few would be used for the vertex shader effects. If there are a lot of vertex effects, then the opposite is true. It is flexibility here that makes a unified architecture efficient, and improvements to efficiency lead to improvements in overall performance. It’s like a spork, and we all know how much a good spork kicks ass.

With NVIDIA’s G80 and ATI’s upcoming R600 GPU, it seems that both companies have decided that the unified shader architecture approach is the way forward.

Texturing Powerhouse

In the cluster diagram to the right you can see both the "stream processors" (scalar arithmetic units, or ALUs) and the texture units. Every cluster has a scheduler which is coupled to 16 stream processors (one ALU block), 16 interpolation units (which will not be further discussed here) and 4 texture address units.

These addressing units each have two texture filtering units at their disposal, unlike in earlier NVIDIA and ATI designs where each addressing unit was coupled to a single filtering unit.

What this means is that simple operations, such as bilinear filtering, operate as if there were 32 TMUs on the chip. More complex filtering however such as trilinear, anisotropic, and FP16 filtering will operate as if there were 64 TMUs.

That's 18.4 billion filtered textured fragments per second for the 8800 GTX with either bilinear or more advanced filtering (unlike the stream processors, the texturing units operate at the core speed).

In practice, when selecting High Quality rendering in the control panel (as all tests throughout this review were conducted), all the pixels on the screen are filtered with at least trilinear filtering, and possibly with some extra anisotropic filtering, so performance is very similar to a graphics card with 64 TMUs.

Overall this is an interesting design choice. Even without its doubled texture filtering ratio (which allows for the nearly free anisotropic filtering), the G80 would still have substantially more texturing power per clock than anything before it.

Also, unlike previous NVIDIA GPUs, the texturing units are decoupled from the arithmetic units. This results in higher efficiency with advanced filtering, as the arithmetic units will not stall while waiting for the texture units to finish. ATI’s R5x0 family also has decoupled texture units, which certainly helped their 16 TMUs compete against NVIDIA's G7x 24 TMUs. Just like on the R580, the texture units on the G80 remain hardwired to specific ALU blocks.

ROPs and Memory Subsystem

The 8800 GTX has six Raster Operation Partitions, or ROPs, and each of those can process 4 pixels per clock for a total of 24 pixels/clock with color and Z processing (the 8800 GTS has one of the ROP partitions disabled, resulting in a total of 20 pixels/clock). For Z-only processing NVIDIA has implemented a new technique that allows for an amazing 192 samples/clock to be processed when a single sample is used per pixel (check the chart below for Z-only results. G80, putting the ZOMG in Z-only).

The G80 ROPs support multisampled, supersampled, and transparency adaptive antialiasing with four new antialiasing mode; 8x, 8xQ, 16x, and 16xQ (more on those later), along with the normal 2x and 4x AA modes we’ve all grown accustomed to. They also support blending of FP16 and FP32 render targets, finally allowing HDR+AA on NVIDIA hardware (maybe now that NVIDIA supports it we’ll actually see it used in games!).

Each of the six memory partitions on the 8800 GTX GPU provides a 64bit interface to memory giving us a combined 384bit interface width. On the 8800 GTS one of the memory partitions have been removed resulting in an 320bit interface width. A high-speed crossbar, similar to the one on NVIDIA’s previous 7-series GPUs, supports DDR1, DDR2, DDR3, GDDR3, and GDDR4 memory types on both SKUs. The 8800 GTX uses GDDR3 clocked at 900MHz (1.8GHz), providing 86.4 GB/s of raw memory bandwidth. The 8800 GTS also uses GDDR3, but has it clocked at 800MHz (1.6GHz) providing 64.0 GB/s of bandwidth (the exact same bandwidth as ATI’s X1950 XTX, incidentally).

content not found


Copyright 2009 © Rage3D.com

You may not use content, graphics, or code elements from this page without express written consent from Rage3D.com

All logos are trademarks of their original owners. Used with permission.