
The above paragraph serves to outline one of the major perils associated with analyzing the perpetual AMD-Intel struggle: becoming enamored with the image of the heroic if tiny AMD, battling the evil, scheming blue giant. This only serves to undermine one's critical thinking and leads him down a perilous path of over-indulgence. Of course as with all things, the opposite extreme exists: it's a vitriol drenched corner where words such as "fail" or grim prophecies about immediate death of the green-team constitute the norm. We had to walk a rather fine line, trying to avoid falling in love with either of the above mentioned ideologies. Hopefully we've managed to do just that and this article ended up being that which it was meant to be: a lucid look at AMD's new CPU, the Phenom II (codename Deneb), that gives you a useful data-point in your quest for knowledge.
Having cleared those secondary aspects (please, don't misconstrue the above as a way of claiming pure objectivity: we're humans, and it's part of our nature to be subjective - we've just strived to ensure that that doesn't cloud our judgment), it's time to start our trip from a rather interesting base-camp: the year 2003.
Once upon a time, there was K8 ... and it was good
Six years ago AMD introduced its K8 architecture, betting quite a bit on it. The Sunnyvale company had managed for the first time in its history to be rather competitive with its considerably larger "foe", Intel, with its K7 CPU. However, the K7 was beginning to lose its breath and a substitution was needed, otherwise the game would've become as boring as it had been in older times. The K8 was similar to Ronaldinho in his heyday: it came on the "pitch", outmatching its opponents with ease. This was helped by the fact that at the time Intel had taken an alternative path with its CPUs, opting for a rather deep pipeline and reduced work per cycle/clock, aiming to offset these two hindrances with very high clocks. However, those very high clocks ultimately proved to be unrealistic targets, and the Pentium 4 ended up losing that round.
Another significant breakthrough AMD achieved with the K8 was the establishment of its Opterons as a serious option for the highly lucrative x86 server market, which had been serviced almost exclusively by Intel up to that point. In fact AMD ended up being so interesting that it became capacity constrained, thus slightly shooting itself in the foot. There's definitely something to be said about the merits of properly forecasting/evaluating what you can realistically supply to your customers. However all was beyond peachy and for all intents and purposes it seemed like a new era was dawning, one in which the ruler of the roost had green feathers. Of course this is all quite simplified, and quite a bit of time could be spent on discussing the differing aspects that lead to the state of things during that timeframe. However, living in the past is a thing better left to the Quantum Leaping Scott Bakula, and we'd be far better off hitting the fast forward button leaping to the year of our lord (insert your deity of choice here) 2006.
It took Intel 3 years to turn the tides of “war”, but those were 3 years well spent: the new Core microarchitecture that resulted was (and still is) quite excellent. Whilst we'll get into the juicy details hiding in the seas of silicon a tad bit later. At this point in time it should suffice to say that with its introduction, Intel had the best desktop CPU in all regards. It also had a good server proposition, which was important since it was in the server market where the greatest delta existed before. For multi-socket servers and for certain workloads, AMD retained something of an advantage due to some architectural traits of its CPUs, namely HyperTransport and the Integrated Memory Controller(IMC), but more on these later.
In case you're wondering what AMD was up to, the answer to that question is multi-faceted. Apparently they were doing little, introducing a number of mediocre tweaks to the K8 architecture, and giving the impression they were merely resting on their laurels. Less obvious to the outside world was the internal turmoil centered around bringing forth a new architecture. It's unclear just how things went on, but it is known that at least 2 architectures were considered, went into development and were subsequently scrapped. This involved wasted money and resources, and when your competitor outguns you in those areas significantly, such losses are seldom without consequence. Finally, the design of choice was the one we'd come to know as K8L/K10/Barcelona/Agena, which represented a natural evolution of K8 rather than a complete re-imagination of the successful chip.
AMD presented its new baby in 2006, but availability was planned for 2007 which gave Intel a rather serious head-start. However, going by what was disclosed and by the confidence shown by AMD officials, it appeared as if the K8L was going to mark a return to competition. What happened in practice was quite different from those early predictions.
Barcelona - beautiful city, awesome processor?
Late in 2007 AMD's much talked about new processor was finally released. Initially in its server-targeted Opteron incarnation, and subsequently as the desktop-oriented Phenom. To say that it was quite a flop would be something of a truism. Given the amount of hype that had been generated and the general expectations that had been established, partly based on overly-optimistic estimates by AMD representatives (“definitely in the double digits” rings a bell?), the K8L should've been the mother of all CPUs ever seen up to that date. Unfortunately for AMD it actually ended up being a rather troubled part that had significant issues with achieving satisfactory clock-speeds. Adding extreme insult to grievous injury, it also failed to achieve clock for clock performance parity with the competing Intel parts.
So, much lower clocks, lower clock-for-clock performance... could anything else go wrong? Actually, yes: AMD did a wonderful job FUD-ifying itself over an erratum involving the L3 TLB. Had a competitor attempted to create more Fear, Uncertainty or Doubt over the thing, it is doubtful it could've done a “better” job. Contradictory statements, apparent disorientation, and rather fatalistic reports helped amplify what would've normally been a moderately trivial thing into a completely show-stopping issue. When things were finally brought back under control the damage had already been done.
The 65nm Phenom drudged around (what else was there to do?), slowly ramping up clocks, getting a TLB-bug free B3 revision, and ending up near the end of its lifecycle in intense competition with the Core 2 Q6600... Intel's lowest end quad-core CPU. For servers things were slightly different since AMD's architecture had a number of advantages there, so it managed to be more competitive in the server space. Even still, the overall picture was not a rosy one.
The short history “lesson” above was meant to give an idea of where the 45 nm shrink of the K8L came in and what work was cut out for it. It had to fix the mess its older brother created, and had to bring at least some level of competitiveness back to AMD's offerings. Had it failed to do so, the company would've been relegated to a state of complete irrelevancy on the CPU market for at least the next 2-3 years. Which would've also more or less meant the near complete undermining of its status as a viable competitor to Intel. Throughout this article we'll try to see if those aims were reached. We'll also look at the K8L architecture and compare it with its competitors, try to grasp where AMD was coming from with its design choices, and hopefully provide you with an entertaining and detailed read.
With all that in mind, be aware that the first part detailing the architectures will be devoid of benchmarks, so if you're on the prowl for graphs and disinterested in silicon fueled yappings you'll have to absorb the next couple pages first.
One of the preferred fuels for angry flames and/or marketing vehicles has been the true/untrue multicore debate. On one hand AMD prided itself on having a “true“ dual-core CPU in the Athlon X2s, whereas the competition, Intel, had a “fake”, glued-together counterpart in the Pentium Ds. In the ramp up to Barcelona(Agena), some extra chest beating happened over the fact that it would be the first native, true quad-core CPU. So what's all this true/untrue ruckus about you ask? Well, mostly about choices, trade-offs, and target-markets.
Simplifying things a bit, a monolithic design (the “true” multi-core approach) implies that the CPU is a fully integrated affair. Which is to say that the die contains the logic for all N cores, their caches, and possibly the memory-controller. By contrast, a non-monolithic multi-chip design involves taking M separate N/M core dice and package them together to get an N core CPU (e.g. take two 2-core dice and package them together to get a singe 4 core package). Integration is obviously reduced in this case, with the memory controller being typically off package. The inter-core data sharing then happens via main RAM which tends to be a slow affair (compared to the alternative).
Now, remember that we were saying something about trade-offs? Time to see what these are. First of all, opting for a monolithic approach is one of the fundamental design choices made by the architects and it must be made in the beginning. It affects most, if not all subsequent aspects of the architecture. The complexity of going monolithic is higher than using a multi-chip package approach. Achieving target frequencies is more challenging and yields tend to be lower since the die itself is larger. Overall a monolithic design represents a fairly considerable challenge for the one attempting to create and produce it.
But what is there to gain by opting for going monolithic? Why would architects make all those trade-offs? Primarily because of the gains they bring to bandwidth sensitive workloads that involve datasets that don't fit into caches. Data-mining would be an example. Before fleshing out the topic more, it would make sense to look at how AMD and Intel went about their multi-core business. Note that throughout this article we'll focus on Deneb (45nm K8L) and Yorkfield (45nm Core 2) - however, most of everything that will be said also applies to the older 65nm implementations, and some applies to less than 4 core CPUs:


As you can see, the two competitors are rather different beasts, with AMD opting for a monolithic 4-core implementation, and Intel deciding to package together 2 monolithic dual-cores in order to get a quad-core device. Each approach yields certain specific advantages, and each is adequate/proper for the market it was intended for.
Starting from the top with stuff that's directly comparable: for Deneb we have twice the L1 per core versus Yorkfield (64K L1D+64K L1I versus 32K L1D+32K L1I). Getting to the L2, for Deneb there's a discrete 512K pool for each core, whereas in Yorkfield there's a fairly huge shared 6MB pool per 2 cores (or per discrete die). Deneb has an extra 3rd level cache that is shared between all 4-cores. Cache coherency between the cores is handled via this third added level. This was one of the things K8L added, and it's one of the extra “tricks” Deneb gained versus Agena, which had only a 2 MB L3 cache. But more on this when we discuss the memory subsystem. Speaking of cache coherency, for Yorkfield it's obviously different: for cores on the same die it's handled via the shared L2 (under special circumstances, the cores can share L1 cache-lines as well), whilst for cores on separate dies it's done through RAM.
Since we're on the topic, the way the CPUs communicate with the outside world is also significantly different: AMD has the memory controller integrated on-die, whereas Intel places the memory controller off-die within the northbridge and communicates with it via the Front-side Bus(FSB). In fact, all traffic goes through the FSB in Intel's case: coherency traffic between multiple processors in multi processor systems, memory accessing, command dispatch to GPUs etc. AMD's approach is more elegant, with memory accessing being decoupled from all other traffic by virtue of the aforementioned Integrated Memory Controller(IMC), and the rest being handled by the low-latency point-to-point HyperTransport link(s).
Time for the reality check: most of this is irrelevant in the desktop space, since the FSB is seldom (if ever) a limiting factor or overly congested as traffic is quite low. Also, desktop applications are unlikely to ever be truly bandwidth bound, with data locality being typically excellent (datasets fit within caches nicely so you seldom end up refilling from RAM). This counteracts the advantage of the IMC's higher throughput, and lower latencies aren't something that'll make your Crysis experience better or mean a lot for desktop users (more on the topic when we discuss the memory subsystem). However, for servers things change quite dramatically: you have to account for significant coherency traffic between separate processors, you also have cores sharing cachelines and datasets seldom fit into cache, so you're also accessing RAM quite intensely. If you're thinking this clogs the tubes for Intel you'd be right (this is one of the major things Nehalem fixed). The fact that the FSB imposes an odd topology in which the northbridge acts as the nexus/arbitrator for all CPUs doesn't help at all. You'll see AMD winning many server benchmarks against Intel CPUs prior to Nehalem by virtue of it's better I/O system.
All of this lengthy exposition was meant to give a preliminary insight into what was the goal for AMD's architects as they were (feverishly) working on K8L: design an excellent server CPU that builds on existing strengths and exploits the competitor's weaknesses, and scale it downwards into the desktop space. Whilst this generally means an excellent desktop CPU as well that's not necessarily a given since demands imposed by the two market segments aren not quite congruent. As we continue to explore the underlying architectures, you'll see how some design decisions derived from AMD's goals may have bitten them in the gluteus maximus in the desktop battle against the Core 2s, which were primarily designed as excellent desktop CPUs that could be scaled upwards into the server space.
All of you are probably thinking what hides under those boring heatspreaders / beyond those colorful representations we drew above. You'd like to flex your muscular muscles and rip the soldered metal off to explore the cores themselves (yes, ladies, the world of hardware enthusiasts is filled with muscular guys that rip stuff ... really!), and gain intimate knowledge of them. Well, luckily for us, AMD and Intel already did that for their software optimization guides... and they even took pictures! All joking aside, here are the simplified depictions of how a Deneb / Yorkfield core looks and works:
The role of this part of a modern out-of-order (OOO) CPU is to handle instruction decoding, and keep the OOO engine “fed” thus preventing costly stalls. As such it's components typically handle Branch Prediction, Instruction Fetch, and Instruction Decode. Let's do a comparative analysis for these. Over the next few pages we'll follow the basic structure for each section, consiting of: Deneb detail, Yorkfield detail, and then a comparision between the two.
Branch Prediction
Branch Prediction: Deneb
AMD uses a Branch Target address Buffer (BTB), a Global History Bimodal Counter (GHBC) table, and a Return Address Stack (RAS)
The BTB is a 2048-entry table containing predicted branch target addresses whilst the GHBC is also a table, with 16384-entries containing 2-bit saturating counters used to predict if a conditional branch is taken, and indexed using the outcome of the last 12 conditional branches
There's also a separate Indirect Branch Predictor(IBP) used to predict indirect branches with multiple dynamic targets (single target indirect branches use the BTB table). There are at least 512 entries in the array of targets used here, but probably more based on some documentation AMD sent out alongside Deneb
The return address stack is 24-entry deep, and is employed in predicting the return addresses from near or far calls via push-popping that brought back OpenGL programming memories: during call fetching, the next return address is pushed onto the stack with subsequent returns popping a predicted address off the top
Branch Prediction: Yorkfield
Intel's bag of tricks is more loaded, including alongside the “common”(in purpose, implementations probably differ) BTB, Global History Table, RAS and Indirect Branch Predictor, there is also the Loop Stream Detector (LSD - no, not that LSD)
The BTB and global history table are similar in purpose and operation to those in Deneb, however Intel does not share the specific details about the entry count for its target address or history tables
The IBP picks its targets based on global history and can handle both single target indirect branches as well as multiple target dynamic ones
RAS depth is lower than Deneb's, at 16 entries
The LSD seeks to detect loops that are candidates for the Instruction Queue(IQ), and in the case such a loop is detected it's allowed to stream from the IQ until a mis-predict happens; a candidate loop must be smaller than 18 instructions and should contain no more than 4 conditional branches and no RET instructions
Another tweak involves the queuing of Branch Prediction lookups: predictions are made for 32 bytes at a time (2x the width of the fetch engine), which means that contrary to prior architectures in which predicted branches introduced a 1 cycle penalty, the penalty for taken branches in Yorkfield is generally 0 cycles
Branch Prediction: Deneb vs Yorkfield
A tough comparison to make since Intel shares fewer details than AMD, with neither discussing the BP algorithms being used (these are regarded as trade secrets). However the additional elements in Yorkfield, coupled with anecdotal evidence indicates that it generally achieves a better prediction success rate than its competitor (we'll have some data on this later on).
Instruction Fetch&PreDecode
Instruction Fetch&PreDecode: Deneb
To maintain decoding rate, the K8L predecodes instructions as the cachelines are being fetched into the L1 Instruction Cache(L1I), storing instruction labeling info in special fields in the cache (3 bits predecode info per each instruction byte)
Instructions are fetched from the L1I cache in a 32-byte window, which is double the width of its K8 predecessor or its Intel competitors
This wide fetch was chosen in order to ensure that no front-end starvation would occur when dealing with SIMD and 64-bit instructions which can have lengths as high as 7-10 bytes: a 32-byte fetch means that decoder-empty cycles are reduced to a minimum
After the instruction window is obtained from the L1I, its contents are examined to establish whether they're DirectPath (simple instructions<=2 macro-ops) or VectorPath (complex instructions>2 macro-ops) decodable
Instruction Fetch&PreDecode: Yorkfield
Contrary to Deneb, the fetch window for Yorkfield is 16-byte wide. This choice underlines the desktop focus since SIMD and 64-bit instructions are hardly as widely utilized in this space. Most most typical programs average about 4 bytes per instruction (according to Intel's estimates)
After the instruction window is obtained, instruction predecode takes place, and instruction length, prefix decoding and instruction properties are established
The predecode unit can write up to 6 instructions per cycle into the IQ (instruction queue). In the rare case in which the fetch window contains more than 6 instructions, predecoding continues to take place at 6 instructions per cycle with all subsequent fetches entering predecoding only after the current fetch finishes. For example, an 8 instruction fetch would be predecoded across 2 cycles with the first cycle handling 6 instructions, the next handling the other 2, and the third cycle handling the next 16-byte fetch window
It makes sense to discuss the IQ here, since it sits somewhere between predecode and decode: as already mentioned, it's 18 instructions deep and can send instructions into the decoders at a rate of 5 per cycle. It may have a certain buffering effect when dealing with suboptimal fetching scenarios (e.g. long instructions that don't fit into the fetch window nicely), by absorbing some of the latency generated and keeping the decoders fed at a steady rate instead of inducing decoder empty cycles due to low fetch rates
Instruction Fetch&PreDecode: Deneb vs Yorkfield
Deneb holds an advantage due to its wider fetch window. An advantage that ensures the decoders will always be “fed” at a steady rate (as long as good data locality is in place of course, if the needed instructions aren't in the L1I the widest of fetch widths won't help one iota). This is bound to matter in more server-centric scenarios where long 64-bit SIMD instructions are used, think HPC for example, with these scenarios being likely to make Yorkfield fetch-starved. For typical desktop workloads it's not likely to make a significant difference. Also, the IQ may help Intel to hide some of the latency associated with fetch starvation.
Instruction Decode
Instruction Decode: Deneb
There are 2 separate decoders, one for DirectPath instructions and one for VectorPath instructions
A DirectPath instruction can be decoded into one (Single) or two (Double) macro-ops. A macro-op consists of one integer of floating-point micro-op and one load/store micro-op
A VectorPath instruction is decoded in 3 or more macro-ops, and is decoded using the on-chip microcode-engine
The outputs of both paths keeps instructions in program order and each can generate up to three macro-ops per cycle. The outputs are multiplexed together and passed to the next stage in groups of three. Due to the fact that decoding a VectorPath instruction may prevent the simultaneous decoding of a DirectPath instruction, or fetching stalls, it's possible that the decoder generates only 1 or 2 macro-ops in total, in which case the group of 3 is filled with empty macro-ops and passed to the next stage
For optimized handling of stack instructions, there's the Sideband Stack Optimizer(SSO). Typically there's an implicit dependency between successive PUSH/POP instructions which the SSO seeks to remove in order to allow parallel execution of more such instructions. To do so the SSO consists of dedicated circuitry that relies on two registers, one that stores the original value of the stack pointer, and another that tracks the changes made to the stack pointer value and is modified by a dedicated ALU whenever a stack modifying instruction is detected
Instructions that benefit from the SSO are Near CALL, Near RET, LEAVE, those that specify the stack pointer as source register and those that specify the stack pointer in the addressing mode of a memory operand without an index register
The SSO can't remove dependencies between the above and other instructions that refer explicitly or implicitly to the stack pointer
Instruction Decode: Yorkfield
First of all, there's a nomenclature difference between AMD and Intel that needs to be pointed out: Intel tags X86 instructions as macro-ops, and the simple, architecture specific instructions being run on the execution units as micro-ops. Keep the difference in mind if you're sifting through either company's documentation
There are 4 instruction decoders contained in the front-end of each Yorkfield core, 3 being simple decoders that handle instructions that translate into single micro-ops, with the 4th being able to decode instructions up to 4 micro-ops in length
For instructions that decode into more than 4 micro-ops, the micro-sequencer is used at a rate of 3 micro-ops per cycle
The generated micro-ops are “fed” into the micro-op buffer that is at least 7 entries deep, from which they get sent into the execution engine at a rate of 4 micro-ops per cycle
In order to optimize the decoding process further Intel also implemented macro-fusion and micro-fusion support
Macro-fusion merges two incoming macro-ops into a single micro-op at a rate of one fusion per cycle (fusing typically involves compare or test instructions and jumps and is available only in 32-bit operation)
Micro-fusion, as the name suggests, handles the fusing of multiple micro-ops pertaining to a single instruction into a single more complex micro-op. This should increase effective bandwidth between the decode stage and execution
Finally, there is dedicated hardware to handle stack instructions, which functions quite similarly to the SSO discussed above but goes by a different, more intuitive name (before anyone gets riled up, Intel preceded AMD in implementing it), namely the Stack Pointer Tracker.
Instruction Decode - Deneb vs Yorkfield
Looking at the competitors' parameters, it would appear that Intel's CPU should be better at instruction decoding than AMD's, having all of the latter's features and in addition including further tweaks like micro and macro-op fusion and a higher decoding width. It's probable that in 64-bit mode, where macro-op fusion doesn't work, a part of this advantage will disappear.
Front End
Front-end: Deneb vs Yorkfield
Coalescing all that has been discussed up to now, we can surmise that the battle of front-ends is somewhat balanced, with an edge to Intel. AMD's CPU has a wider fetch-width, and thus is better at effectively getting instructions “fed” into the decoders, but it is probable that Intel's is better at branch-prediction, and is also a tad more efficient at instruction decoding. Be aware that whilst the function is similar, the results of instruction decoding are hardly equal between differing architectures, with the same X86 instruction generating differing micro-op counts depending on the CPU doing the decoding. For example, the P4's decoders used to generate a high number of micro-ops per X86 instruction (on purpose), and that it's commonly considered that AMD tends to output fewer micro-ops than Intel which may act to somewhat offset its decoding disadvantage.
However, most of these differences can be qualified as minute since in practice it's unlikely that either CPU will suffer from front-end starvation, excluding particular cases. (example: long instructions not aligning nicely in Yorkfield's instruction fetch window, or branch heavy code coupled with an intense mix of simple and complex instructions forcing DirectPath-VectorPath ping-ponging causing decode bubbles on Deneb). So we wouldn't be too horribly mistaken to say that the real battle will be fought at the next level of our analysis - the Out-Of-Order Execution Engine.
The execution engine for an Out-of-Order(OOO) CPU is a complex affair, and is also the make or break part of it. It has to deal with extracting instruction level parallelism by foregoing typical serial execution, and then after completion, re-order the results in the succession that was initially intended. Be aware that whilst that may sound simple, the number of tasks involved is quite daunting and the execution engine is one of the main causes for extra gray hairs on CPU architect's heads... and also much joy for them when they hit the proverbial “home-run” in this area. Deneb and Yorkfield are extremely different in their approaches as you'll soon see.
The Execution Engine: Deneb
The “master of disaster” for Deneb is the so called Instruction Control Unit (ICU), which manages the centralized reorder buffer, the integer and floating-point schedulers (you might've noted already that AMD opted for discrete integer/floating-point reservation stations). It is responsible for macro-op dispatch and retirement (remember, per AMD macro-op=1 INT/FP micro-op and 1 load/store), register renaming, management of execution resources, interrupts, exceptions, as well as handling branch misprediction cases (when pipeline needs to be flushed and the execution stage restored to the pre-branch state). As you can see, the ICU is a pretty busy guy indeed.
Each cycle the ICU takes the three macro-op group provided by the decoders in the front-end and places them in the 72-entry deep centralized reorder buffer. The buffer is organized as 24 lines of 3 macro-ops, not 72 independent entries as one may believe. A correspondence is in place between each of the 3 macro-op lanes and the reservation stations down-stream. Once an instruction has entered a specific lane it remains fixed there without the possibility of being moved around, and this is particularly important for the integer unit since it can lead to stalls if the macro-op arrangement in the Reorder Buffer(ROB) is not optimal. At the same time, the ICU dispatches up to 6 macro-ops each cycle to both the integer and floating-point schedulers (up to 3 integer and 3 floating-point macro-ops). It is also responsible for handling macro-op retirement, a task that is handled at at a peak rate of 3 macro-ops per cycle.
The ICU controls 40 registers dedicated to Integer work, and 88 registers for Floating-point work. The Integer registers are organized as the Integer Future File and Register File (use that as a pickup line the next time you go out, it's bound to paint you as a science buff), and split into three sets:
1 set for the Architectural Register File, consisting of the 16 64-bit non-speculative registers demanded by the x86-64 ISA - these registers are only modified by retired instructions
1 set for the Future File, consisting of 16 speculative equivalents of the architectural registers and containing their most recent speculative state
1 set of 8 registers used as scratch-space for Micro-Ops
For Floating-point(FP) ops there's a 120 FP register file coupled with one architectural and one future file array, each containing pointers to registers in the register file that contain the non-speculative and respectively speculative states of the x87, MMX and XMM (SSE) registers as well as a registers intended to be used as scratch-space.
Moving further downstream, it's time to look at the Integer and Floating-point units:
The Integer unit consists of two components, one being the scheduler and the other being the execution unit itself. The scheduler is comprised of three discrete 8 entry reservation stations that correspond to the three lanes in the centralized ROB. A Macro-Op from Lane 0 will enter Scheduler (reservation station) 0 and so on, with Macro-ops being broken down into integer and address generation Micro-Ops at this stage.
The execution unit consists of three identical pipes, each pipe being an arithmetic-logic unit (ALU - here be math) and an address generation unit (AGU - here be special case math for generating logical addresses for load/store Micro-Ops) tag-team. Each pipe corresponds to a Scheduler, and can only be issued work from its respective scheduler. Micro-op execution depends on operand availability either from the register file or the result buses, and Micro-Ops pertaining to a single operation can be executed out-of-order. Upon complete execution of the outstanding Micro-Ops pertaining to a certain Macro-Op, a completion signal is sent to the to the ICU by the scheduler.
All ALUs within the execution unit are symmetrical, being able to handle all integer operations with three notable exceptions: integer multiplication and the SSE4a instructions POPCNT (population count) and LZCNT (leading zero count). For multiplication, a pipelined multiplier is attached to Pipe 0, so multiplication Micro-Ops always issue there. The issue logic creates bubbles in the result bus in pipes 0 and 1 by preventing non-multiply Micro-Ops from issuing at the appropriate time. This is done in order to create space from the multiplier to place its results on the bus without conflicting with results from the ALUs. The same story applies to POPCNT and LZCNT, but the unit handling them is attached to Pipe 2 as you can see, and result bus bubbles are created only in that pipe.
Remember from when we discussed instruction decode in the front-end that scenarios may arise in which the three Macro-Op group that gets sent into the ROB by the decoders may contain empty Macro-Ops when less than 3 effective Macro-Ops are generated. This can be carried downstream and can cause a scenario in which one of the integer reservation stations can see high usage whereas another sits empty as it consumes empty Macro-Ops. This is a suboptimal pattern as ideally you'd like symmetrical usage of the reservation and by extension, symmetrical full-time ALU utilization. Time to move on to the Floating-Point Unit!
As was the case for the Integer Unit, there are two components making up the Floating-Point one. A scheduler and the execution unit itself. The scheduler is congruent with the one in the Integer Unit in purpose, handling more or less the same tasks, as well as Macro-Op uptake rate, being able to accept three Macro-Ops formed from any of the following instructions types: x87 FP, 3DNow!, MMX, SSE1/2/3/4a. Macro-Ops from the decoding stage are fed into a dedicated 42-entry scheduler buffer organized as 14 lines of 3 Macro-Ops, with each Macro-Op lane corresponding to a lane in the ROB (Reorder Buffer). Float to Integer movement and conversions are handled via a 64-bit wide bus.
The execution unit itself is made up of three 128-bit asymmetrical pipes, dubbed FADD FMUL and FSTORE (FMISC in certain contexts, including our drawing). That arrangement means Deneb is capable of achieving 2 Dual-Precision FLOPs per cycle. For example, an ADD and a MUL, with something extra happening on the FSTORE pipe( such as executing the Micro-Ops pertaining to an instruction involving Integer to Float conversion like CVTDQ2PD). However, there are certain instructions that require the use of both the FADD or FMUL pipes and the FSTORE pipe in which case throughput is reduced – for example, CVTPD2DQ, which is similar to the above quoted instruction but reversed in that it's a Float to Integer conversion. Loads, as opposed to Stores, don't use FPU execution units.
The execution engine: Yorkfield
There's no point in restating much that has already been said above, since the execution song and dance is pretty similar for OOO CPUs, with more or less the same moves. However, dresses may be different but some dance better than others. What that means is that within Yorkfield we're still looking at a centralized ROB, we'll still need to handle register renaming, Micro-Op issue and retirement, falling back from mis-predictions and so on. Since seeing is believing, here's how the execution engine looks:
The ROB is 96 entries deep, accepting 4 Micro-Ops from upstream and being able to write the results of 4 retired Micro-Ops into software visible registers per cycle. Intel opted for a single, general 32-entry Scheduler/Reservation Station, as opposed to AMD's dedicated Float/Integer schedulers/reservation stations. Each cycle up to 6 Micro-Ops can be dispatched from the RS through the issue ports. We'll ignore ports 2,3 and 4 for the time being since these are dedicated to memory operations, and focus on 0, 1, 5, since these are connected to the math-crunching units. Each port connects to a 64-bit ALU and a 128-bit SSE unit with port 1 tying into a 128-bit FADD unit and port 0 tying into an equally wide FMUL unit. The ALUs aren't symmetrical, as was the case with AMD, with ALU 1 lacking 64-bit IMUL, for example, but other than that the differences are small. FP Moves can be handled via all ports, and 128-bit shuffling and packs/unpacks being the domain of port 5 exclusively. Division(DIV) and square root(SQRT) are port 0 affairs as well. The Yorkfield is well capable of 2 double-precison FLOPs per cycle, as you probably already deduced. Instruction results are retired at a rate of 4 per cycle in case we didn't mention that already.
The execution engine: Deneb vs Yorkfield
Looking strictly at the execution units available in each architecture, one would be tempted to assume that they're quite evenly matched. However, that assumption would be an oversimplification. In practice, Intel has a number of advantages, the most pronounced ones being encountered in the area of scheduling: having completely decoupled memory ops from computational ops, with each having dedicated dispatch ports. Also, using a unified Reservation Station, with no fixed correspondence between the arrangement of Micro-Ops and the execution unit they're dispatched to, means that Yorkfield will see better utilization rates in many cases. AMD's arrangement is more restrictive and is more likely to suffer from suboptimal scheduling, especially in the integer pipeline. Width is also an important aspect, with Yorkfield's execution engine being 33% wider (4-wide versus 3-wide for Deneb), but in practice you seldom if ever hit maximum theoretical IPC rates, so it's not a prime differentiator. Also, Intel has a better radix-16 (24) divider implementation, versus AMD's radix-4 (22), which in theory should allow it to compute twice as many quotient and remainder bits per iteration, thus halving overall division time and by extension improving other operations that rely on division (square root for example).
All in all, Yorkfield should be better for Integer workloads based on how the architectures look, whilst for Floating-Point and SSE workloads it's harder to ascertain a priori, since the FP engine in Deneb is less restrictive with regards to scheduling compared to the Integer one and otherwise quite apt.
Since talk is said to be cheap, and we're too poor (as reviewers) to afford to be cheap, it's time to break the monotony of wordplay by looking at some tests. More specifically, let's look at how the two execution engines fare when dealing with synthetic Integer, Floating-Point and SSE workloads:



The results aren't exactly flattering for Deneb: on average it's about 17% slower in Integer workloads, with Floating-Point and SSE ones being 16% and 6% slower respectively. The instruction mixes used are quite simple, and the working sets are geared towards ensuring best data locality (read, fit into cache) in order to ensure the highest possible IPC, so you could consider these tests as optimum case conditions. Still, these numbers do confirm that Yorkfield's execution engine is superior in terms of througput.
The memory system: Deneb
As already mentioned, Deneb works with a three tiered cache hierarchy. On the first level we find the 64K L1 instruction(L1I) and L1 Data(L1D) caches (a total of 128K per core), characterized by a 2-way set associativity and using 64 byte lines. The consequence of a miss in the L1I is the fetching of two cachelines, the one including the sought instructions and the next sequential one. This is based on the fact that typically code has good spatial locality, and as a consequence by doing this sort of prefetch, decode stalls can be avoided. Cachelines are replaced according to a Least Recently Used(LRU) policy.
The L1D is a dual-ported writeback cache, with port width being 128-bit, also using a LRU cacheline replacement policy. Cache coherency is maintained via the MOESI protocol, with ECC support being in place as well. Finally, AMD quotes a 3-cycle load-to-use latency for its L1D.
Of course, this says nothing about exactly how operands are fetched from the L1D into the execution engine, or about how results are stored into it. For this we need to look at Deneb's Load-Store unit, depicted below:
Recall that each Macro-Op is comprised of an Integer or Floating-point Micro-Op and a memory Micro-Op (load/store), that the Macro-Ops get broken into Micro-Ops within the schedulers, and that the AGUs(address generation units) are tied into the Integer schedulers. Memory operations start there, being dispatched to both the AGUs and to the first queue in the Load/Store unit(LSU), dubbed LSU1, which is at least 12 entries deep (we say at least because there are indications that AMD might have increased it's depth in Deneb). Address generation is a 1 cycle affair, with results being forwarded to the same LSU1 so that data can be accessed. LSU1 can issue two L1 cache operations per cycle, and it can issue load in a limited out-of-order fashion: a load may be issued ahead of another load, but it can only move ahead of another store if they're accessing different addresses. If the store address is not known, a load cannot be reordered to take place ahead of this store even if it turns out it accessed a different address. The second queue in the LSU, LSU2 is at least 32-entries deep and holds requests that missed the L1 cache. It's also exclusively responsible for handling stores. A peculiarity is that whilst Deneb is well capable of handling 2 128-bit loads per cycle, a 128-bit store is split into 2 64-bit writes, thus taking 2 places in LSU2. The above also means that Deneb is capable of either 2 128-bit loads per cycle, a 128-bit load and a 64-bit store, 2 64-bit stores, or a single 128-bit store.
Now that we know how the L1D-Execution Engine relationship works, we can get back to the caches themselves. In the event of an L1 cache miss, LSU2 escalates the request to the 512K L2 cache. The L2 runs at core frequency, is 16-way set associative and uses the same 64 byte line size as the L1. It is exclusive of the L1, acting as a victim cache, holding only cachelines that were evicted from the L1 in order to make room for new ones being fetched into it. The data paths between the L1D and the L2 are 256 bits wide – 128 bits transmit and 128 receive – which means a single 64 byte cacheline is transferred in 4 cycles. Upon fetching a cacheline from L2 into L1, it is deleted from the L2 in an attempt to eliminate redundancy. AMD quotes a latency of 9 cycles beyond that of the L1, which places L2 latency at around 12 cycles.
If the L2 doesn't hold the sought data either the request is escalated to the 6MB L3 cache. The L3 was a novelty for AMD back when it was introduced in the original Barcelona and was a source of many problems for that CPU, some of which were fixed in Deneb. It is a non-exclusive non-inclusive (AMD calls it non-inclusive victim cache), and is 48-way associative using a 64 byte line size. The L3 holds L2 victims, and when a cacheline is fetched from the L3 into the L1D it can either be deleted from the L3 as was the case with the L2 and which would be representative of a wholly exclusive arrangement, or retained in the L3 if it is likely to be shared. Sharing history is tracked, and based on that it is determined if the cacheline has been shared before or not. In addition, a cacheline is considered as shared and thus not deleted if it contains code. Eviction policies also take into account sharing, with LRU (least recently used) unshared lines being preferred candidates for eviction ahead of LRU shared lines. The L3 is dynamically shared between all cores, with a round-robin algorithm being in place in order to arbitrate access.
One of the (many) interesting aspects about the L3 is that it runs on another frequency domain alongside the integrated Northbridge and on a different power plane (the last part depends on the motherboard implementing the feature). Initially the goal was to have the L3 running at a potentially higher frequency than the cores. In practice the reverse happened. There is also collateral evidence that AMD had problems with handling clock-crossings. At any rate, AMD quotes a best-case latency of 29 cycles for Deneb as opposed to 34 cycles for Barcelona, in spite of the associativity increase for the latter (higher associativity typically means an increase in latency due to the need to check more tags). This seems to indicate they've done some fixing and optimizing in their L3. As should be fairly obvious the L3 is characterized by variable latency, with latency depending on the NorthBridge(NB)/L3 frequency, as well as on the number of concurrent accesses (so the best case latency is likely to be representative for single-threaded workloads with low bandwidth requirements).
If the programmer has really been a bad-boy and decided to be really mean to the CPU, you'll get an L3 miss, which means data will have to be fetched from RAM. Deneb includes an integrated low-latency, high-bandwidth memory controller that runs at NB/L3 frequency. It can be configured to operate either as 2 independent 64-bit channels or as 1 128-bit wide channel. Whilst the second case is fairly old news, the first one is quite interesting (as it was back in the day when DEC had the same idea for Alpha), since it's nicely tailored for multi-threaded environments in which different threads may do different memory operations. A little known aspect is that the IMC has included DDR3 support for quite a while, namely since Barcelona's introduction, but that capability was left unexposed due to the lack of economic sense (DDR3 was too expensive). With Deneb that changes, and CPUs with DDR3 support have been brought to the market (the integrated memory controller was slightly tweaked as well), albeit they rely on a new AM3 socket which does not support AM2+ CPUs a safety measure. AMD also included a pattern predictor, which uses page accesses and per-bank access history in order to establish whether or not a page should be kept open. This should on one hand improve latency and achieved bandwidth, since pages that are likely to be accessed are kept open as opposed to being haphazardly closed.
A 20-entry write buffer is also present in the memory controller, which means memory writes can be queued in order to avoid bus turnarounds. Finally, a data prefetcher capable of capturing both positive and negative stride values, as well as more complicated access patterns is included. Prefetched data is kept in the memory controller and not speculatively filled into the caches.
Speaking of prefetching, each core within a Deneb has a set of 8 data prefetchers (separate from those in the memory controller mind you) that speculatively load data into the L1D. In fact all data loads go into the L1D, which is a reverse of traditional approaches which placed data into inferior cache levels from which it was fed into the L1.
The memory system: Yorkfield
Unlike Deneb, Yorkfield contends with a “traditional” 2-tiered cache hierarchy. Reprising the same top to bottom trip from above, we are first greeted by a 32K L1I and 32K L1D (a total of 64K per core), with 8-way associativity and 64 byte lines. The L1D is dual-ported, with a 128-bit port width as well as writeback. Cache coherency is maintained via the MESI protocol, and ECC support is present, with Intel quoting a 3 cycle latency for its L1D with the mention that there's 1 extra cycle of latency for FP Loads.
Of course, we now need to figure out how the execution engine interacts with the L1 cache. Yorkfield completely decouples memory ops from compute ones, issuing them via their own dedicated ports as can be seen below:
Ports 2, 3 and 4 are destined for memory ops, which are dispatched through them to the Load, Store Address and Store Data units. Each of these contains an AGU, in case you were wondering where these were hidden. Results are moved into the Memory Ordering Buffer (MOB), which enables speculative out of order issuing of loads and stores, ensures load store data correctness upon retirement, and proper load and store ordering. Up to one 128-bit Load and one 128-bit Store can be executed per cycle. Loads can be issued before preceding stores when their respective addresses are known not to conflict just like in Deneb's case. However, Intel took things a step further allowing for the speculative issuing of loads before stores with an unknown address based on the assumption that the store will not target the same address (Intel dubs this Memory Disambiguation). If a conflict occurs the conflicting load and all succeeding instructions are re-executed. However, research shows that in over 90% of cases memory addresses don't alias, so conflicts should be rare, whilst the advantage of avoiding stalling due to unknown address stores is quite consistent.
An L1 miss means the request is escalated into the 6MB L2 shared by two cores (for a total of 12MB for most Yorkfield models). The L2 is 24-way associative and uses, as you've probably guessed by now, 64 byte lines. It's also non-inclusive non-exclusive, and it's connected to the L1 via a 256-bit wide path (128-bits for transmit and 128-bits for receive respectively). Intel's papers quote a 15 cycle access latency for the L2, and L2 access is arbitrated in a round-robin fashion. Keep in mind that the L2's latency is variable.
If the L2 doesn't hold the needed data, unsurprisingly it has to be fetched from RAM. We'll not discuss memory controllers for Yorkfield though since these are chipset dependent and can come from either Intel, or from 3rd parties like nVidia or even ATI (the former is a special case, and only for dual-core Wolfdales). Latency for fetching data from RAM is equal to 15 core cycles + 5.5 bus cycles (the FSB runs at a different frequency from the core, typically 1333 Mhz) + the latency of the memory operation itself. This is also the cost of fetching a modified cacheline from the L1D or L2 of the other die since the transfer is done via RAM.
Intel was more aggressive than AMD with regards to its prefetchers, including them for both the L1 and the L2. For the L1 there are two hardware prefetchers per core (a total of 8 per a Quadcore Yorkfield), one for data and one for instructions respectively. Moving down, the L2 prefetch logic speculatively fetches data based on the history of past L1D refill requests. Two independent arrays are maintained to store addresses from the L1D, a 12 entry one for positive strides and a 4 entry one for negative ones (entries for each of the two cores within a die are handled separately). The prefetcher is also capable of detecting more complicated stride patterns and can issue up to 2 prefetch requests on every L2 lookup, being able to be up to 8 cachelines ahead of the Load request. However, this is dependent on available FSB bandwidth and request count, with far ahead prefetches being possible only if FSB utilization is low.
The memory system: Deneb vs Yorkfield
Going by what is regarded as common knowledge across the Intarweb, this part should be deceivingly easy to write and it should amount to saying that Deneb rules the memory roost. However, things are murkier than that and require context: Deneb has indeed a superior memory system when considering the general case of a quad-core dealing with heavily multi-threaded code. Its integrated memory controller brings lower latency and higher throughput, and the unified L3 is an elegant and faster way of sharing data between threads. The trouble is that on one hand the aforementioned general case is anything but frequent in the desktop space (albeit it is the way of the future), with most applications opting for simple threading with little to no inter-thread dependencies. On the other hand Intel engages in rather aggressive prefetching to hide its higher latency RAM accesses, whilst also being better at out of order execution of memory ops (remember the aspects about memory disambiguation). Of course prefetching has its own perils that affect both competitors in a different fashion in different contexts:
It consumes bandwidth so it will degrade performance in bandwidth intensive scenarios. This is something that is almost a non-issue for desktop applications and less of a problem for AMD which has a more ample supply of bandwidth
If the working set fits too “snugly” into the cache, aggressive prefetching will cause trashing as lines that are still needed are evicted to make space for prefetched ones. This is a problem for both, and its exacerbated on Deneb due to its policy of prefetching only to the L1 and the exclusive nature of the L2. This can cause an evicted L1 cacheline to end-up in the lower speed L3 or be deleted entirely, only to be fetched from RAM in the next cycle
On top of that, if the programmers followed Intel's guidelines and maintained data sharing threads local to a die (or both), then cachelines are shared via the L2. This is significantly faster than Deneb's L3, so in those scenarios Yorkfield would be at an advantage (dual-threaded workloads would fit nicely in this mold, for example). Summing up and simplifying, it is another case of Deneb being significantly better for typical server workloads where bandwidth and the impact of maintaining cache coherency are significant determinants of performance, but not being extraordinarily advanced in desktop applications since both of the aforementioned aspects have a reduced importance.
In closing, keeping with the example we set before, here's a quick synthetic verification of the theoretical considerations made above:

The latency numbers were obtained via the CPUID latency tool, whereas the bandwidth numbers come from Everest's inbuilt test (albeit we've verified both via other means, and they're reasonably accurate, the RAM numbers can be better or worse depending on how one sets up the test). In terms of latency, results lineup nicely with numbers stated by AMD and Intel. Looking there is also a good reminder of just how slow main RAM truly is compared to caches. In terms of bandwidth one can see Deneb's capacity to do two 128-bit loads per cycle from the L1 exposed in its Write numbers for that cache level, as well as its 64-bit Store width limitation in the Copy tests where it loses to Yorkfield which can do one 128-bit Load and one 128-bit Store per cycle.
AMD's L3 is pretty 'meh' in terms of latency and throughput, with the latency being high (nota bene, it can be even higher than this!), and the throughput being low. For desktop applications (you've probably grown tired of the formula, but it's important for establishing context), it's not in any way unlikely to see scenarios where Yorkfield would fetch from its L2 whereas Deneb would have to fetch from L3 which would have it at a disadvantage. Of course if coherency traffic is high across all cores and memory is also being accessed intensely, the L3 would be in a position to prove its worth by providing better latency and bandwidth for inter-core cacheline sharing, whilst also maximizing RAM bandwidth by decoupling the aforementioned sharing – but such a scenario is again the appanage of server applications.
With this bit, the purely architectural trip is over. We could also discuss about HyperTransport and the FSB, but to be frank their impact on performance is quite limited for desktop workloads. Since a rather excellent investigation has already been undertaken by our friend David Kanter over at RealWorldTech, so we'll direct you to those articles instead:
To conclude this section, here's how all the disjointed pieces look when put together in order to form the architectures we've been babbling about across the last few pages:
Brief look at tests, specs and other such things
In case anybody is still awake after going through the preceding pages of technical hubbub, and for those that knew that the best option was to simply skip it since it was fairly devoid of graphs, it's time to get into the more mundane aspects of this review ... ehm, we mean the real world real important stuff. We've structured the testing into 4 segments: synthetics, applications, games and an experiment we've attempted on this occasion. The first three segments represented by synthetics, applications and games will be fleshed out in this write-up. The experiment will come in another piece in a few days due to the fact that this article is large enough as it is, so this is a good way to keep it from achieving yokozuna-worthy proportions. At any rate, below you'll find the list containing all tests so that you know what to expect, as well as the specs for the testing systems:


We'd like to make you aware about a few quirks that the Biostar motherboard we used exhibited:
NorthBridge clocks have to be adjusted manually in the BIOS, otherwise the Auto setting underclocks to 1.6GHz versus 1.8GHz which is the default for the PHII 940 we used
The HyperTransport link is underclocked to 1.0GHz by the Auto setting, and somewhat annoyingly can't be set higher than 1.6GHz via the BIOS, forcing us to use AOD to set the proper 1.8GHz clock
The motherboard does not support Split Power Planes, and it also does not support QuadCF configurations due to the way PCIE lane splitting is controlled
This would be a good point to discuss other things like why is Nehalem missing or why we are focusing on a clock-per-clock comparison between AMD's and Intel's CPUs in spite of pricing being not quite on par. Let's be totally rebellious and discuss them at the end shall we? And with that being said let's see how the architectural deductions we've made so far pan out in practice!
In all fairness, we've already dealt with these tests when we were looking at Execution Engine throughput for differing types of workloads or at the Memory System's numbers. However, a further step is needed before transitioning to typical applications. A step in which we look at synthetics that test common algorithms/scenarios in an attempt to predict how real applications using them would perform. Consider them a crossover of sorts, focusing less on simple execution of very optimized code, but still more optimized than your average program.
First, let's look at the tests built into the popular Everest suite. Dealing both with Integer as well as Floating-Point workloads, using either SSE or legacy code, and involving some frequently encountered workloads:

The tests marked CPU are purely Integer whilst the FPU ones are (not surprisingly) Floating-Point. The Queen test solves one of the more famous problems in computing, that of non-attacking Queens. Its performance is significantly dependent on branch prediction efficiency. Performance here may confirm that Intel's CPU is better at branch prediction, as we had assumed, since the math involved is (hopefully) simple enough to not overemphasize Yorkfield's better Integer performance. We'll have the opportunity to further look at this topic later.
PhotoWorxx performs image manipulation and emphasizes Integer ADD and MUL as well as memory controller performance. Deneb wins here consistently due to its superior memory controller and especially due to the ability to run it in Unganged mode as dual 64-bit channels. Running it like this produced significantly higher results than Ganged mode (which was about on par with Yorkfield). This makes sense given the fact that the test itself is intended to be read & write heavy while also multithreaded, so running the channels independently improved command rate and CPU request servicing.
We'll not go through the other Everest tests since they don't emphasize particular architectural aspects as the above mentioned two, and it would be not exactly word effective to detail each. For thos interested this information can be found in the Everest help file or on the Internet if you're curious about them. It's worth mentioning though that we can't exactly put a finger on why the FPU Julia test, which uses single precision floats, is out of line with the others (which comply nicely to the performance delta that the FPU tests ran before suggested). A bit of profiling work would probably clear the mystery but that's a task for another day.
Another synthetic test we already met is MetaBench. Asides from the simple op (ADD,SUB,MUL,DIV) execution rate tests already shown, it also has a more ample testing suite that dabbles with some more ample workloads:

In tasks that tend to be sensitive to memory latency/throughput like sorting (lots of MOVs), simple OGG audio decoding, or encryption, Deneb does better than its competitor. It is difficult to say why heap sorting and large object sorting are comparatively slower without knowing the way the test sets up its code and how it breaks down the work-sets. In purely compute intensive tasks like raytracing, compression, or FFTs, the tables are turned. This was to be expected based on how both CPUs fared in terms of execution. One big outlier is OGG encoding, so we'll keep an eye on how that particular task is handled in the real world.
These don't need a helluva lot of explaining, do they? They're the programs you're going to run on your PC. Some of them may be the reason for a processor upgrade, or the determining factor in choosing one processor or another. Be aware that as a general rule we always used the multithreaded option where present (CineBench is a good example for this since it produces single threaded results as well, which we ignored). We also went for the 64-bit version where available. We did split this subsection into a few categories, starting with Rendering:

We tried covering as much ground as possible, and calling Paint.NET a rendering test is perhaps incorrect since it's a photo manipulation suite. However, be a tad indulgent with us as it would've been lonely all by itself. What all these applications share is that they're computationally intensive, and usually quite optimized in the quest for extracting the best possible performance. Considering the above it's not that surprising that Yorkfield maintains a comfortable lead across the entire battlefield. Paint.NET is a pretty poor performer when matched with the K8L architecture (we had already seen this back with the Phenom 9950 versus Q6600 comparison we did for the 790GX review), and does not seem to match the results Everest's Photoworxx suggested. However, the scope of the included image manipulation workloads is far wider than that of Photoworxx and this is a real world scenario.
Out of all the rendering tests presented, 3DS Max may be the most important since it is the one that has the most chances of being used in a production environment (no disrespect intended for the others, mind you). As you can see, on a clock per clock basis Deneb is 7% slower when using the Mental Ray renderer (Flyby), and 9% slower when using MAX's inbuilt raytracer (Architecture). Not exactly a flattering result. Going forward, media encoding represents another important task for which CPU performance is quite important:

Things are slightly less clear here, with the combatants being evenly matched in x264 and WMV encoding. Yorkfield winning heavily in DivX (one of its traditional favorites) and in MP3 and OGG encoding (confirming what we saw in the synthetic OGG encode test albeit with reduced amplitude). What we've noted via some profiling work is that Deneb gets a significantly lower branch-prediction success rate in our DivX test, which coupled with the fact that its L2 hit rate is inferior to Yorkfield's explains the performance delta to a large extent. For the audio encoders it's a case of arguably simple code (they're mostly X87+MMX, with sprinklings of scalar SSE and sometimes packed SSE instructions) running as fast as possible. Since hit rate for the L1 is very high, and work sets fit into the L1+L2 hierarchy for both CPUs, the Yorkfield's wider and more efficient execution engine pays dividends in this case.
By contrast x264 relies quite significantly on more complex SSE instructions (as a minor bit of trivia, it's the only application that uses SSE4a - more specifically, the LZCNT instruction - from what we know), and is also more memory system intensive. This matches Deneb's advantages in instruction fetch and memory accessing quite nicely.
Now that you've encoded your videos you may want to get something out of a .rar or .zip file and maybe even zip something up yourself:

Little to comment here except noting that the WinRAR inbuilt test which we used is a decompression test, and as such it's quite sensitive to memory latency and bandwidth. Finally there's one more area that gets CPUs sweating and may hold relevance for typical desktop users, and that is scientific computing:

You may take issue with deeming the FritzChess benchmark a scientific computing test but it would've fit nowhere else. It does give some insight into how these CPUs handle a fairly typical decision (game) theory problem, focusing primarily on the process of computing the consequences of decision nodes (as far as we can tell from the benchmark itself that is).
Euler3D is actually grabbed straight from the scientific domain, being put together by the fine folks from the Oklahoma University's Computational Aeroservoelasticity Laboratory. It is FPU intensive and does quite a bit of memory accessing. It is in fact one of the more memory access intensive “desktop” apps we've seen. Looking at the second part one would immediately assume that Deneb should win, but this is a case where the tri-level cache hierarchy may hurt it: the test scales across all cores, but inter-core data sharing is limited if not non-existent. This means the L3 brings no advantages on that front whilst Yorkfield doesn't have to pay the price of inter-die sharing via main RAM. Since the test scales to all cores, L3 latency is likely to be higher than what we measured/AMD touts as a best case scenario since all cores would be hitting the L3 and arbitration would come into play. A consequence of this being that the latency for a RAM access is also increased. Finally, from what we've seen most of Euler3D's instructions are legacy X87. It's likely that both Deneb and Yorkfield are fetching from the L1I gleefully and keeping their execution engines away from front-end starvation. This is a case in which the Yorkfield holds an advantage, not to mention that is also gets a slightly better branch prediction success rate (97,03% versus 93,82% for Deneb).
Closing the Applications section and the Scientific Computing subsection you can see in the chart above a long list of R tests. The R project is, quoting its very site, “a free software environment for statistical computing and graphics”. The range of problems that can be programmed into R is quite extensive and, making a pretty long story short, for those that have use for such tools it's quite awesome (this comes from experience). It's also something that is pretty seldom tested since it would appear that the headcount for the aforementioned user category is not that impressive, or they're not vocal enough. At any rate we've decided for testing it, and to do so we've used the 2.5 version of the R-benchmark script (link) . This includes a vast array of calculations/functions, albeit it's in no way exhaustive, and it is growing a tad long in the tooth. While properly discussing every subtest would require more words than this article can accommodate at this point, we have to discuss the 3 tests where differences are huge, namely 1,12 and 13. What tests 1 and 12 have in common is that they both include computing the transpose of a matrix. Since both show scale breaking advantages for Yorkfield it's safe to assume that there's an issue with this particular computation in the R version/BLAS dll we used. 13 is more difficult to figure out (admittedly we haven't yet), but our guess at this point in time is that it is also an R+Deneb oddity, or once again the BLAS dll we used. We've run the same calculation through Scilab 5.1 (a somewhat similar application), and while Yorkfield was still faster, it was within the confines of reason to the extent we'd expect from it's better compute muscle.
Ah, finally, the section that's likely to see the most accessing. Games are pretty self explanatory aren't they? It will definitely be nice to see how the “best CPU for games” moniker that Deneb received pre-launch holds up to scrutiny. Ok we're cheating, we already know the answer to that... thinking about it, you probably do too given how reviewed to death the Phenom IIs were. First of all here's how and what we tested in this chunk:

Note that we used in-built tests wherever possible to minimize variability between runs. Yes we know it's not really really real world but something has got to give, and when differences are small enough added noise is something you want to avoid at all costs). Also note that whilst we maximized in-game settings, we did NOT enable GPU exclusive goodies like AA and AF since that would've de-emphasized the CPU's merits or lack thereof.
Of a similarly elevated importance is the fact that we went for a different configuration for the game tests, switching to a DFI LanParty 790FXB-MRSH motherboard and plugging in 2 4870X2s for a quadsome. The motherboard switch was needed since the Biostar doesn't allow Quad CrossFire as we already mentioned. Catalyst drivers are also a newer build than those used for the other tests, version 8.61.1. All else remains equal and with those not particularly exciting details out of the way let's see how average “life” flows:

What an incredibly surprising turn of events that couldn't have been anticipated... umm, sorry, wrong pre-baked line, that's for something else. Where were we again? Ah, yes, we were at the point where we note that gaming performance is in line with what one would come to expect/could extrapolate from prior results. Games tend to like branch-prediction and compute muscle. They also tend to have good data locality with worksets nicely tailored for caches. At the other end of the spectrum, sensitivity to memory bandwidth and latency tends to be reduced. This combined with the above makes games a nice match for Yorkfield (as Intel most probably anticipated as it was designing the CPU). The effects of this arrangement can be seen across the entire spectrum of games barring 2 notable exceptions: Mass Effect and GRID. Grabbing a chunk from the future “experiment” article, we'll mention that GRID is memory intensive compared to other apps we've profiled (to a surprising extent!). It also has a pretty granular multi-threading approach in which inter-thread data sharing is not a rarity. These aspects conspire to bring Deneb its victory, and whilst we haven't done a proper profiling job on Mass Effect or other Unreal Engine 3 games, we'd be willing to bet the situation is similar there. Whether or not these are portents of the future remains to be seen however, since other modern games don't exhibit such a behavior.
As we all know, even an average life has its lows and as such it only makes sense to explore those too:

Heh, would you believe that? Once again no surprises. Some Minimums are missing as you might've noticed: that's due to them not being reported/properly reported. Aside from that there is little to discuss that we haven't already discussed. We could get into the whole “smoother” debate, but quite frankly that's well beyond our abilities – sorry to disappoint you guys! And with this admission of our own lack of adequacy, we can conclude the performance investigating part. Time to move on to closure, and not a moment too soon!
Looking at many of the tests you may be asking yourself that question. You may also think we're being major douches for being mean to Deneb. But in a completely surprising turn of events we'll admit that not only do we like it, but we're also impressed with what AMD managed to do with it!

Let's qualify and flesh out that statement: AMD got right everything that they could get right by means of a simple process shrink. They brought core clocks out from the deep pit in which Barcelona and its troubled 65nm existence had brought them, evening out that playing field (this really was a major disadvantage). They bumped L3 size up from the rather pedestrian 2MB the original Phenom sported. While not exactly lightning fast nor particularly size efficient, the L3 plays an important role given the rather small L2 size for AMD's architecture. Its latency is still hugely smaller than that of a RAM access. With 2 MB being shared by 4 cores, congestion was almost a given in multi-threaded workloads, but with 6MB you're keeping work on-die far more often to a notable benefit. Finally, not only did they solve the clock scaling issue, but they also brought power draw and heat output to sane levels. For all of these AMD deserves kudos... so where's the problem?
Well, the problem is the base architecture, and the fact that in terms of compute throughput it can't quite match Core 2 in a head to head deathmatch. We'd pin this difference primarily on its rigid decode and schedule arrangement, with its two symmetrical simple-complex decoders and the fixed structure of the Reorder Buffer to name just two notable aspects. We're also not convinced that the space efficiency benefits of the exclusive scheme used for the caches (excluding the L3) outweigh the costs of more ample snooping (you have to check all levels in the hierarchy). Changing these is something beyond what can be done within the confines of a process shrink. We'll have to wait until Bulldozer comes to see how AMD has decided to move forward from its respectable yet long in the tooth K8 base architecture (its heritage is quite present in K8L/Phenom). The Integrated Memory Controller helps to cover some of this delta in some desktop apps, but for server ones the tables are turned since workloads there favor Deneb's architecture significantly. The Core 2's are hamstrung by their reliance on the FSB and their MCM approach for creating a Quadcore CPU. However, for server-oriented investigations you'll have to go to other, better equipped to handle them authors, out of which we'll once again outline our friend David Kanter.
At this point in time you may also be wondering why no mention of Nehalem, Intel's latest and greatest. Beyond our raging fanboyism, there are a few other reasons for that: we don't have a Nehalem sample on hand, total platform cost (TPC) for it remains a tad too high due mainly to the costly X58 based motherboards and, perhaps more importantly, it can be easily summed up by saying that it is better. Simplifying, Nehalem keeps the best parts from Core 2 (it's Front-End and its Execution Engine) and adds a few tweaks to them while changing the memory system to something eerily similar to what Deneb has. This is a testament to its merits in the context of increasing core counts and more complex multi-threading, but it has a few tweaks. Can you see any reason why it wouldn't be better? All in all this remains on the to-do list.
Actually we did, and we deferred discussing that to this part. This piece was intended as an architectural comparison more than anything and its goal was to see how Intel's and AMD's architectures fared when equally clocked with other differences kept at a minimum. As tech geeks that's what interests us first and foremost. However that's only part of the equation.
At this time the Phenom II 940 is priced at $214.99 at Newegg, with a $25 discount bringing it down to $189.99. At the same time the Q9650 is priced at $339.99 ($309.99 after a $30 discount) at the same shop ... 50% more means that these CPUs obviously aren't intended to compete with each other directly, not to mention that the speed advantage the Q9650 has doesn't justify such a price premium. Pricewise the 940 is positioned against the Q9400 (cheaper than it, actually), which has a 12% lower core clock deficit and half the L2 cache (3MB per die for a 6MB total). Making the ignorant assumption that the cache reduction won't matter, the core down-clock alone should make it possible for the 940 to jump ahead of this only adequate competitor. Platform costs for both should be pretty much equal, contrary to what you might've heard (platform costs for Nehalem are indeed higher, but that's not here is it?). AMD has also recently further fleshed out its product line and there's a throng of products out there. We'd dare say that up to Q9550, and even 9650 levels AMD is quite competitive. You'll note that we haven't talked much about DDR3. Although support for that has been exposed via the new AM3 CPUs, to be frank it's not at all interesting at this point in time either price-wise or performance-wise. Since the more accessible lower speed bins have higher latencies than their equally accessible DDR2 equivalents, and the higher more expensive speed bins can't be entirely leveraged with current NorthBridge clocks, the performance benefits don't justify the price premium at all. DDR2 is dirt cheap even for 1066MHz sticks with decent latencies, so grab 8GB of that and don't look back.
Going forward we'll go ahead and make a wild guess that AMD will release a part that slightly outperforms the 9650 just like the 940 outperforms the 9400. On a tech level, from what we've seen there may be parts that have 2.2GHz NorthBridge clocks and 64-way associativity for the L3, though we're not ready to bet the farm on that. 6-core CPUs are on the verge of coming out in server incarnations, albeit whether or not they'll grace our desktops anytime soon is uncertain. What is certain is that until AMD comes out with a new architecture its competitive position isn't great, despite Deneb marking a return to competitiveness. For the time being they'll have to contend with hoping that this Argentine Tango they're dancing, in which Intel is the leader, won't involve many sacadas since they're not likely to weather an all out offensive. And that's all folks ... for now!