Graphics Cards

Nvidia GT300 ''Fermi'' Architecture Unveiled

Image courtesy of bit-tech.net

Yesterday, on the new GPU Technology Conference organized by Nvidia, the new Tesla cards were shown and the chip's architecture was exposed - the focus is again on CUDA.

The new chip is huge:


It's expected to be close to 500mm2 while AMD's Radeon 5800 Series weighs in 334mm2. It is manufactured at 40nm at TSMC but we still don't know any specifications regarding core, shader or memory clock. What we do know is that the card has a 384-bit bus, capable of delivering 50% more bandwidth than AMD's Radeon 5870. The presentation today was mostly to scare off potential Radeon 5000 buyers and to address the main target of this new architecture: High Performance Computing.
So is the card shown by Huang, which is smaller than current generation high-end cards. This isn't strange, as Tesla cards have always had a lower power consumption due to having disabled parts of the die dedicated to graphics rendering, hence the smaller form factor.

There's no doubt about the HPC/CUDA focus, just look at what the new architecture brings:
  • ECC support - like everywhere, registers, memory bus, memory and caches. This is straight from Jen-Hsun Huang's mouth as he presented the chip. R800 chips from AMD only have ECC for memory bus errors.
  • 8x double precision performance on floating point calculations over GT200, half of peak single precision performance.
  • Unified address space, which enabled support of object oriented programming(C++).
  • L1 and shared memory are interchangeable so one of them can be 16 KiB or 48 KiB, as the programmer chooses.
  • Addition of an L2 cache(the blue in the middle of the chip), of 768 KiB, shared by all cores.
  • 384-bit width of the memory bus ensures further scalability in compute but also in games over the previous generation. (AMD forgot this and just touts peak performance numbers)
  • Can execute more than one compute kernel from the same context at the same time.
  • Can switch between contexts(full programs) up to 20x faster than the GT200 architecture.
  • Full IEE 754-2008 compliance for all floating point calculations.
  • 64bit memory addressing capability, up from 32bit(4GiB), although limited to 1TiB for now.
The addition of an L2 cache is major: it brings general purpose programming on GPUs to a level not possible before. Memory alignment requirements are relaxed and you can actually do some efficient programming of short loops and some serial code without having too move data back to the CPU. In GT200 running serial code on the GPU is as costly as moving data back to the CPU, mostly due to memory controller design. Some pieces of code in an algorithm can't be parallelized and previous generations were very bad at dealing with such code. Moving the data back to the CPU to run them is the last thing you want to do and a big limitation to general purpose GPU computing.

While I personally don't care about ECC, I've done some scientifc programming on GPUs and I've noticed no problems even with G92 based GPUs. I'm sure someone does, as this is a requirement for most HPC machines. For those, Nvidia has that covered.
IEEE 754-2008 compliance is very important and another hindrance that was removed. The previous architecture only allowed for full compliance on 64-bit calculations - no one wants that. The loss of half the performance even on this architecure is too much. Double precision will be strictly used only where needed, which was the main reason for the inclusion of some DP units on the GT200 architecture. Sadly, they lacked full precision on 32bits which complicated comparing calculations from CPUs to calculations from GPUs, among other problems.

Just like with the GT200 chip, this new architecture is heavily built towards HPC, a market Nvidia wants to address all by itself. The compute power is there, the architecture is more refined than ever. This focus, however, may hurt the ability of Nvidia to compete in the graphics card market, just as with the GT200. The architectural changes done to support these applications increased die size and took up space from other featrues that could have been used for gaming applications. It remains to be seen if Nvidia will be able to compete with AMD on price and performance. I do expect the chip to perform at least 60% faster than the GT200 at the same clock, possibly more due to the addition of the L2 cache.
The Radeon 5870X2 is looking very able to fend off Nvidia's new card, although with all the problems that come from using multi GPU setups.
At the time the "Fermi" is released, the crown of fastest single card on the market will be something very tough to take from Nvidia. Still, AMD might be able to strike back with a cheaper, updated R800 card just like the Radeon 4890 - timing is everything and despite the somewhat under performing 5870, AMD has the advantage right now.

No comments:

Post a Comment