Nvidia Announces Tesla 20 Series

Nvidia unveils more details about it's number crunching baby and we take a look deeper inside the chip.

The following new specs were released:

520-630 GFLOPS/s
Total Memory: 3 or 6 GiB GDDR5, 2.625 or 5.625 GiB when using ECC
Power consumption: 190W
Two models: Tesla C2050 and Tesla C2070

ECC is optional, which is great as far as I'm concerned. Power consumption is not bad and the desktop cards should be similar. Since the card seems to have rendering hardware enabled(DVI is there), power consumption should only be lower on cards inside the Tesla S 1U part.
We do have an interesting number: 520-630GFLOPS/s, a disappointing number for someone who only looks at theoretical performance. Single precision performance will be lower than ATI, my predictions point to 1.66TFLOPS/s and this number will be detailed shortly. AMD cards can do 544 GFLOPS/s in double precision, due to the way the shader processors are designed:

AMD hardware can't do two double precision FP MAD operations per clock, only one, so what you get is 850MHz * 320 SPs * 2 ops (MAD = ADD+MUL) = 544 GFLOPS/s. A good number, it's a shame that the hardware is capped. Don't take AMD's theoretical capabilities to be reflective of relative performance. Nvidia, among other things, has a bandwidth advantage of 50% with "Fermi".

Let's move along to a more detailed explanation of Nvidia's architecture, to get out that number: 1.89TFLOPS/s.

Third Generation Streaming Multiprocessor (SM)
32 CUDA cores per SM, 4x over GT200
8x the peak double precision floating point performance over GT200
Dual Warp Scheduler simultaneously schedules and dispatches instructions from two independent warps
64 KB of RAM with a configurable partitioning of shared memory and L1 cache

This is the core of the new Tesla cards, 16 shader multiprocessors (SMs) with 32 cores each. These cores are more tightly coupled together than in previous architectures, where Nvidia coupled them in two(G80) or three(GT200) per thread processor cluster(TPC):

Picture courtesy of Anandtech

"Fermi", looks more like this:

The scheduler can dispatch two independent warps(a group of 32 threads from the same thread block) to each of the 16 cores, the SFUs or the load/store units. Previous hardware dispatched half-threads and ran them twice on the SPs, once per clock cycle for each of the 8 threads in an half-warp. The same happens in the new hardware but only full warps are scheduled - Nvidia had already hinted this would happen in future hardware when it referred to warps in some parts of the CUDA programming guide.
The hardware sends a warp to each group of SPs and they process 32 threads along two clocks, 16 per clock.

Theoretical performance on Nvidia's hardware can only be achived by using the SFUs as units capable of performing 4 single precision MUL or MAD FLOPS/s per SFU, but not FMAD. This is due to the flexibility of the interpolation hardware, which allows this to happen. Assuming that SFUs are unchanged from the GT200 design(edit: they are, see the bottom for updates on specs), that equals:

( 32 SPs * 2 (FMAD) + 4 * 4 SFUs ) * 16 (SMs) * 1.296MHz (core clock) = 1.66 TFLOPS/s

This is possible because ~~- or at least in the two previous architectures - the SFUs could~~ the SFU can be schedulled in parallel with the SPs. The clock frequency is obtained from the number of dual precision operations that we know the hardware supports:

630 GFLOPS / (256 ops * 2 (FMA)) = 1.23GHz ~ 1.296MHz, a number more commong in Nvidia's graphics cards due to the clock generators generally used. The C2050 card will work at a slower ~1020MHz.
If SFUs are upgraded to process information more single precision FLOPS per second, the hardware may be capable of 1.89TFLOPS, assuming the same proportion as in the GT200. This doesn't seem likely, hence Nvidia releasing just the double precision performance(sligthly better than AMD) or they would risk having some people jumping the fence to the other green side - the AMD card is capable of 64% more theoretical performance, at 2.72 TFLOPS/s. Remember, ATI throttles their new cards to an almost ridiculous extent, so it's very likely that such peak is like Pentium 4's performance: an order from the marketing department.

The apparently low theoretical peak performance is saying nothing about the performance of the card in games. HPC users already know they're getting the card, but we don't know anything about the hardware dedicated to graphics rendering that is not at the core. We know Nvidia has bandwidth on its side and I believe that will help Nvidia regain the single-GPU graphics card performance crown. At which cost and when, that is the real question. Two or three months from now, an AMD Radeon card properly overclocked may result in very disappointing launch.

My estimate is that the card will be 50-60% faster than a GTX 285(updated on November 19th). The "Fermi" desktop card will enjoy a bandwidth advantage of 50% and 56% more SP processing power, assuming the conservative figure of 1.66TFLOPS/s. This increase in bandwidth and compute performance will have to be accompanied by similar increases in texture and render units but the new cache hierarchy may also yield some extra performance that's hard to predict without a disclosure of the full architecture.

The Radeon 5970 is out today. Nvidia will have to hurry up or be forced to move to a better place.

Update: C2050 and C2070 cut down to 448 shaders, clocks are higher than when 512 shaders were used for calculation, theoretical performance still applies.

The Bit Speek

Graphics Cards

Nvidia Announces Tesla 20 Series

No comments:

Post a Comment

Popular Posts

Sponsors

Categories