As you're probably aware, Nehalem's desktop iteration, the Core i7, is the fastest desktop CPU available currently. It's a significant improvement from Penryn in most things, especially in well multithread code. A quick chat with one of the most important people in the HPC scene shows this will also be the case for this particular market.
Kazushige Goto is the creator and lead developer of the GotoBLAS, the top performing basic linear algebra implementation; one that is even faster than Intel's own Math Kernel Libraries.
BLAS implementations are extensively used in scientific calculations and are also used in Linpack, one of the leading HPC benchmarks today, and one of the most commonly used to help tune the performance of supercomputers around the world.
Kazushige is currently tuning the GotoBLAS library for Nehalem CPUs and has already achieved remarkable results with Nehalem, and the work is still not complete.
These are efficiency results, compared to the theoretical peak performance:
AMD Opteron "Shangai" 6MB L3 : 90% of peak
Intel "Nehalem" Core i7 : 95.2% of peak
Intel Itanium 2 : 95.6% of peak
Nehalem is able to achieve a remarkable result in Linpack. Although this is mostly a synthetic result, well tuned programs can come close to these numbers. Hyper Threading is disabled, since it doesn't provide benefits for these applications: the execution units are always well fed.
Since these results concern efficiency, AMD is in trouble. One of the last strongholds that the Opteron holded was the HPC market, where the "Barcelona" core outshined the Core architecture based Xeons were not as good; especially when looking at four socket applications, due to it's traditional FSB implementation. I'm working with 1333MHz FSB Penryn based Xeons which limit me to use 6 cores instead of the eight available they have per node, due to bandwidth limitations.
The new Xeon "Gainestown" fixes that since it supports QPI, the HyperTransport look alike, system interconnect - four and eifght socket systems will no longer pose a problem.
With this kind of efficiency and higher clocks(compared to the Shangai Opteron) Intel will definitely manage to grab an even larger slice of the HPC market.
The top Opteron, which clocks at 2.7GHz, can have as a peak floating-point throughput of 43.2GFlops/s, which translates to 38.8GFlops/s in Linpack.
Nehalem Xeons, from which the W5580 model at 3.2GHz will be the top of the line, can then achieve, also using SSE, 4 x 4 x 3.2GHz x 95.2% = 48.74GFlops/s. That's 26% more performance than the Opteron.
The new Xeons will be one of the next big things to look for HPC applications. AMD will need a very competitive pricing to keep it's current share because, until now, the Opterons were really good CPUs for HPC: they hda an architecture thought for it. Nowadays, Intel managed to do even better.
11 comments:
You misspelled "strength".
Fixed, thanks!
I'm using threaded Goto BLAS (penryn optimized) on a 3.2 GHz Core i7 with hyper-threading enabled. Will my calculation run faster without HT?
It will, as confirmed also by Kazushige, at least during intensive matrix multiplication operations.
HyperThreading is only an advantage if you're not feeding the execution units of each core appropriately. Usually in most scientific codes you should leave it off, as each thread should be feeding them accordingly.
Still, you should run a quick test with your codes. Call a program with 8 threads using HT and compare with running just four with HT deactivated.
If you run only four threads with HT enabled and your OS isn't scheduling properly, you will loose performance from improper scheduling to HT's virtual processors instead of only to physical ones.
Best regards
Thanks! I'll try to run my code with HT deactivated.
The result for Opteron is inaccurate. With ACML 4.2 you can get ~93.5%. I just measured this on a 2.6GHz 1S system with DDR3 1333MHz DDR3 memory using openSUSE 11.0...
Tim Wilkens, Ph.D.
Hi Tim,
Today, I can't find an Opteron CPU for 2S systems that has support for more than DDR2-800. The above results are for a DDR2 system.
These tests are usually done with 2S system CPUs in mind - probably the most common node setup for clusters - for which AMD hasn't got DDR3 yet.
Are you running an AM3 based Opteron 1xxx or Phenom II?
The extra bandwidth of DDR3 would explain the higher efficiency you got, as Goto's approach has always yielded better results than other implementations of BLAS - although I haven't tested the latest versions of ACML.
Best regards
(cont...)
Thank you for sharing your results with me and please follow up if you get to read my reply,
Best regards
Tiago,
My home system is a 2.8GHz DDR2 800 1S system. I'll run it and post the results. I'm the principal author of the BLAS/FFT code in ACML and the efficiency will be identical on a DDR2-800 system. All L3 BLAS functions can be blocked to reuse data from the L2 and L1 and thus minimize the memory bandwidth required. Upon ACML 4.2 I believe the usage is ~1GB/s at 2.6GHz core clock. There's a fraction of a percent, a few tenths effect, and that is likely more impacted by latency than bandwidth (whose efficiency itself is impacted by latency).
A couple points though, the efficiencies you quote, I assume, are asymptotic efficiencies. Also they are on a single thread. Multi-threaded can be similar or equal but you have to go to a larger problem size. ALso there's the manner in which you test. So variability exists in the accuracy of the efficiencies you quote based upon #cores used, manner in which tested (do you do more than 1 timing), at what size do you test.
Hope this helps..
Tim Wilkens, Ph.D.
Tiago,
Last response in regards to the premise of the article. HPC math library performacne is critical, no doubt. Yet, there are other factors which are equally important. Many applications use HPC math libraries containing BLAS/LAPACK/FFTs, yet there are many (I'd say the majority) do not. Of those that do is the math library the critical linch pin in performance? For some yes and for others no. Memory performance is equally important for a class of applications like Computational Fluid Dynamics (CFD) where problems are non-linear. This is true in weather modeling, oil and gas reservoir simulations, drag coefficient and vehicle/structure fluid dynamics.
Additionally, cache structure can be very important depending upon the working set size of the application in question. I'd say 256KB is a bit small and hinders performance in "a class" of applications.
Another factor which may hinder performance is the load throughput of the L1. Upon NH you can only do 1 load every cycle (I've observed that it's more like 7 loads every 8 cycles in GEMM) vs 2 on GH (loads being 128-bits). This can be an obstacle in applications as well. Try doing a tri-vector operation such as a[]*b[]+c[] and you'll see what I mean. Lastly, scalability with multiple cores is another hinderance. This can be limited by your north-bride or via the memory topology/interconnect used to glue all the processors together.
HPC math libraries are very important, but I just wanted to temper their performance with other considerations which are equally and in some cases more important.
Tim Wilkens, Ph.D.
Hi Tim,
I don't have access to Nehalem based hardware, nor Shangai, only to Harpertown based Xeons on 2S systems, so I can't run the tests myself. I have contacted Kazushige to check if he can provide more details of how he ran the benchmarks.
Let me say that you did an excellent job in tuning ACML to achieve that kind of efficiency.I didn't take 90% as a bad result, just not the "current" best.
The 256KB are indeed small in Nehalem but it has a lower latency, 11 vs 15 cycles, and memory access latency is also lower, ~40 vs ~60, which puts AMD's offerings in a tough spot. As you say, cache blocking plays a big role in Linpack and here it seems that 256KB are not a limitation, other applications will surely be different, as you say.
I would say, without being 100% sure, that most codes people run don't care about the 512KB vs 256KB, as they are so badly written that what will dictate performance is the bandwidth and latency of caches and memory subsystems, in which (IMHO) AMD will be trailing Intel once Gainestown comes out - clock speed will also help mitigate the lower L2 size.
Most people I talked with - which have access to both Penryns and Barcelonas - say that although the architecture is horrible - and I agree completely - the higher clock and more efficient execution units in the Core based Xeon pay off, even though you have to run some codes without using all 8 cores of a 2S system at the same time, but like 6 or 7. This is related to memory bandwidth limitations, as you mention, things aren't "glued" right in those platforms. That has been solved with QPI, which in most situations put Intel's Nehalem in parity. The dumping of FB-DIMMs for RDDR3 will also bring 2-3x the aggregate bandwidth that Xeons currently have and is something most people in the HPC crowd are really excited about - most notably some are already using Core i7s in Beowulf type clusters with good results.
I hate that I have to work with Harpertown Xeons but for timing issues, I couldn't have Barcelonas(which I would *very* much rather have) in the cluster were I do performance tuning. Today, the situation is different and seems, by most factors, to weigh in Intel's side.
If you can, do send me your results with your DDR2-800 system, I look forward to compared them and post them here. I'll post benchmark details once Kazushige shares them with me.
Best regards
Post a Comment