Announcement

Collapse

Attention! Please read before posting news!

We at Rage3D require that news posts be formatted in a particular way, so before you begin contributing to the front page, we ask that you study the Rage3D News Formatting Guide first.

Thanks for reading!
See more
See less

AMD Bulldozer Dual-Interlagos Benchmarks @ Phoronix

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    AMD Bulldozer Dual-Interlagos Benchmarks @ Phoronix

    Lately we have been talking a lot about Intel's latest Sandy Bridge processors under Linux due to their very competitive performance and interesting graphics abilities, but on the AMD side there has not been too much to talk about. On the low-end there is the intriguing Fusion APUs, but on the high-end they don't have an answer to Sandy Bridge until delivering their new "Bulldozer" products closer to the summer. Fortunately, we have the first Linux scoop and performance benchmarks from engineering samples of their 16-core Interlagos server chip.

    AMD Bulldozer Dual-Interlagos Benchmarks @ Phoronix

    #2
    is kinda unfair to compare a 32 core system with a single Sandy bridge imo..
    "There is no beggining, and there is no end.There is no alpha, and there is no omega.You never began, and you will never end."

    Comment


      #3
      Originally posted by badsykes View Post
      is kinda unfair to compare a 32 core system with a single Sandy bridge imo..
      so....bulldozer performance dismal?

      considering the i5 2500k only did it in little over twice the time with 4 cores against a 32 core machine, wonder what he power draw is on bulldozer?

      i'd be interested in seeing if 2 bulldozer chips = double the performance, would that mean one chip could do it in 50...then with 4 cores 200 seconds...so comparing it against another 4 core intel chip would be its ... well, i suppose it is only clocked at 1.8ghz.

      Comment


        #4
        the data points are meaningless, no direct comparison can be made between the configurations. trying to infer single thread performance or IPC from those results is futile.

        Comment


          #5
          Originally posted by caveman-jim View Post
          the data points are meaningless, no direct comparison can be made between the configurations. trying to infer single thread performance or IPC from those results is futile.
          While determination of IPC or individual thread performance becomes difficult due to these oddly matched configs, I don't agree with your assessment of futility. I think the 32 core x86 versus 4 core x86 is a VERY apt consideration. And the point of it achieving a little over double makes for a VERY useful data point to make some pre-calculation based suppositions from.

          First off, that a processor with 8 times the thread count only won by 2x, there's some of the normal SMP scaling issues to work out. Obviously, you never see 100% scaling from each core added except in the most artificial scenarios. But you should be able to expect a heavily threaded app to at least show something better then 2x performance for 8x the number of cores. That much is common sense.

          At least 4x the performance was a minimum it needed to hit to be exciting. Preferably quite a bit more then 4x would be optimal when you invest in a server/workstation setup with THAT many cores and then compare against a desktop setup. So, while it is impossible to make actual calculations, it is not impossible to set reasonable expectations. 8x the core count needs more then 2x the performance to be worth consideration unless that 8x core count is being offered at a ridiculously low price.
          If you feel like I'm hurting your wittle feelings too much, refer me to this thread : A new nicer moshpit???
          "Go screw yourself Apple."

          Comment


            #6
            You don't know TDP, and as you say, what the benchmark workload is intended to demonstrate. Speculating IPC or single CPU performance from that data is futile, insufficient data.

            Comment


              #7
              One of the early dual Interlagos results from the 32 cores running at 1.8GHz indicates that its C-Ray time is a mere 25 seconds. C-Ray happens to be one of our favorite multi-threaded ray-tracing benchmarks. What does this compare to? Well, running an easy OpenBenchmarking.org comparison shows just how fast AMD's Bulldozer is looking to be. While there are other software/hardware differences in play too, the 32-core 1.80GHz Bulldozer system's 25 seconds compared to the Intel Core i5 2500K (quad-core + Hyper Threading; 3.3GHz + 3.7GHz Intel Turbo Boost) at 61 seconds or the dual quad-core AMD Opteron 2384 system of ours taking 127 seconds to complete. The Intel Core i7 970 (six cores + Hyper Threading; 3.2GHz Base Frequency + 3.46GHz Turbo Boost) comes in at about 61 seconds too.
              And so... The 32 core AMD at 1.8GHZ gets owned by...

              However, this is not the fastest C-Ray result we have encountered on our open and collaborative benchmarking platform. A Dell PowerEdge server that's packing four Intel Xeon X7550 CPUs that each have six-cores and Hyper Threading with a 2GHz base frequency with 2.4GHz Turbo Frequency and 18MB of L3 cache, is the current winner in that category as shown by doing this dynamic comparison.
              Intel: 4 x 6 = 24 x 2000mhz per core = 48Ghz @ 13.47 seconds - 57.6Ghz as well assuming max TurboBoost during full run which is very unlikely.

              AMD: 2 x 16 = 32 x 1800mhz per core = 57.6Ghz @ 25.97 seconds

              I'd say there is a startling lack of IPC here for the AMD team.
              Last edited by gamefoo21; Mar 23, 2011, 06:33 PM.
              "Curiosity is the very basis of education and if you tell me that curiosity killed the cat, I say only that the cat died nobly." - Arnold Edinborough

              Heatware

              Comment


                #8
                Originally posted by gamefoo21 View Post
                And so... The 32 core AMD at 1.8GHZ gets owned by...



                Intel: 4 x 6 = 24 x 2000mhz per core = 48Ghz @ 13.47 seconds - 57.6Ghz as well assuming max TurboBoost during full run which is very unlikely.

                AMD: 2 x 16 = 32 x 1800mhz per core = 57.6Ghz @ 25.97 seconds

                I'd say there is a startling lack of IPC here for the AMD team.
                You can't determine IPC because you don't know the % number on how well the application scales across all cores. Without that, it's impossible to know. However, doing some quick math, the application scales close to 1:1 with clocks. The i3 370m is clocked 15% higher than the 330m and is faster by little over 12%. My guess is multithread scaling on that application is less than 50%.

                Comment


                  #9
                  You can't compare scaling with one architecture to another to determine a non-benchmarked performance metric. You don't know if the benchmark uses the same codepath for both architectures or how it is optimized/scaled.

                  Insufficient data.

                  Comment


                    #10
                    Yeah, we need more information and benchmarks.

                    AMD Phenom II X2 555 @ stock clock
                    Xigamtek Knight cooler
                    ASUS M4A79XTD EVO
                    G.Skill 8GB DDR3 1333 (4x4GB)
                    Intel 530 240GB SSD
                    XFX ATI Radeon 4870 1GB
                    Antec Truepower 750W
                    NZXT Source 210
                    Windows 7 x64



                    AMD FX-8350 @ stock clock
                    Gigabyte GA-990FX-UD5 R5
                    G.Skill Sniper 16GB (8x2) DDR3 1866
                    Arctic Freezer 7 Pro 7 rev. 2
                    Gigabyte Windforce 7950 3GB Ghz Edition
                    Samsung 840 Pro 128GB SSD
                    EVGA SuperNova 650W
                    NZXT Source 210 w/ two Noctua F-12 fans
                    Ubuntu MATE 64-bit
                    Intel i5 3570K @ stock clock | G.Skill 16GB (8GBx2) DDR3 1866 | Silicon Power 60GB SSD | Win 10 Pro x64 | NZXT Source 210

                    Comment


                      #11
                      Hey Lupine,

                      Comment


                        #12
                        I love math...

                        Originally posted by gamefoo21 View Post
                        And so... The 32 core AMD at 1.8GHZ gets owned by...<br />
                        <br />
                        <br />
                        <br />
                        Intel: 4 x 6 = 24 x 2000mhz per core = 48Ghz @ 13.47 seconds - 57.6Ghz as well assuming max TurboBoost during full run which is very unlikely.<br />
                        <br />
                        AMD: 2 x 16 = 32 x 1800mhz per core = 57.6Ghz @ 25.97 seconds<br />
                        <br />
                        I'd say there is a startling lack of IPC here for the AMD team. <img src="images/smilies/sherlock.gif" border="0" alt="" title="Sherlock" smilieid="125" class="inlineimg" />
                        <br />
                        <br />

                        You have some bad info here:

                        The X7550 is 8-cores, and 16-threads...

                        Intel Xeon X7550:
                        8 Cores, 16 threads, 2GHz, 18MB L3
                        w/o HT:
                        4x8 = 32 cores
                        32 * 2 = 64GHz

                        w/ HT ( as 1/4 cores )
                        8 HT cores * 4 = 32
                        32 / 4 = 8 core equiv
                        8 * 2 = 16GHz
                        64 + 16 = 80GHz HTAdj

                        AMD Interlagos:
                        16 Cores & Threads, 1.8 GHz, 8(?)MB L3
                        16 * 2 = 32 Cores
                        32 * 1.8 = 57.6GHz

                        So the comparisons are as follows:

                        X7550= 64/80GHz
                        AMD B= 57.6GHz

                        We have to ignore price, TDP, and L3...

                        So the X7550 time-core performance is ~18 secs, while Interlagos is 25 secs

                        However, the X7550 has a thread advantage, running double the threads, and also a clock advantage which is also exaggerated by the thread count.

                        The end result is comparing a 2GHz 40 core system to a 32 core 1.8GHz system. ( or 80GHz vs 57.6 GHz ).

                        .72:1

                        This actually makes normalized performance identical @ 25 seconds between the X7550 and the Bulldozer.

                        If you were to extrapolate this to IPC, this would mean that Bulldozer, in this test, is more or less equal to Xeon if you take away hyper-threading and clock differences in multi-threaded processing for this single application.

                        Now, with comparing to i5 2500

                        i5 2500:
                        4x3.3 = 13.2 GHz @ 61 secs
                        Interlagos
                        32x1.8 = 57.6 GHz @ 25 secs

                        .229:1

                        Direct comparison here becomes impossible due to package limitations.
                        The 32 cores on Interlagos are in four groups of 8, using two double-8 core packages while i5 is on a single package. So the following numbers are entirely useless.

                        So, normalized you have

                        i5 @ 61secs
                        bd @ 109secs

                        It looks bad, with the i5 apparently being >50% faster, but we have much with which to be concerned:

                        Each 8 cores has a performance hit relative to the other groups, so let us extrapolate (GUESS) some overhead.

                        We have four places where performance is lost on the Interlagos platform as tested:

                        2x on-package
                        1x inter-package
                        1x scheduling overhead

                        I know a fair amount about the performance of existing platforms, so I'll try to be accurate.

                        We must assume worst-case usage, but common-case losees as we are comparing performance under full duress, with no cores/threads being left idle, and all being given different data, and results from one core being needed by others non-linearly to complete the overall task set ( this is how many real-world server work-loads behave if they actually maxed ). BTW, the Bulldozer would look MUCH better in a real-world SERVER work load versus the i5 because these overheads would be diminished, whereas the i5 would become over-taxed with the work load - which is why one is used for servers, and the other for clients.

                        The intra-package overhead easily costs a full core in the common-case, so we will subtract 2 cores overall.

                        32 - 2 = 30

                        inter-package overhead can cost a huge number in poorly optimized situations, and still is very high in normal full-load, we'll call this two cores for this load ( these are 1.8 GHz cores, after-all, and we have 16 core packages communicating here ).

                        so 28 cores effective out of 32.

                        Now, we have an issue with scheduling overhead. This is not just the kernel calculation overhead, but also the application's locking, juggling of memory from one package to another due to scheduling imperfections, and exhausting the L3 cache... so we'll cut out another three cores, even though the loss can be much greater in the worst case.

                        In fact I wrote an application that had so much overhead in these channels ( on purpose ) that each core cut performance almost in half to prove my point about locker performance to colleagues ( explaining why HT caused slow downs in so many apps ).

                        So 25 cores vs 4 cores.

                        i5 2500:
                        4x3.3 = 13.2 GHz @ 61 secs
                        Interlagos ( w/ overhead - moderate )
                        25x1.8 = 45 GHz @ 25 secs

                        .293:1

                        61secs vs 85 secs

                        Still, the Bulldozer looks bad in IPC for this particular application vs Intel's i5 2500, but it looks better than phenom II by about 10-15% - which is what I expected from AMD's official comments... so I think this math may actually be a fair indication of IPC, even given the extreme crudeness.

                        Of course, Interlagos is through-put optimized, not speed-optimized like the Desktop versions would be.

                        This means that the Bulldozer CPUs should match the first generation of i7s, with just a touch better performance, which would be a good enough result for AMD. It then becomes a clock-based race again

                        Bulldozer's single-threaded FPU performance should be awesome, looking to be about 70% faster after FPU scheduler overhead is considered... which is amazing...

                        --The loon

                        Comment

                        Working...
                        X