AMD's ''Bulldozer'' And AMD's Take On Threading

AMD revealed some details about it's next generation microarchitecture, codenamed "Bulldozer". From the looks of it, this architecture has been in the drawing board for years and could have been the cause of rumors about the K9 processor and it's cancellation.

As we can see from the above picture, AMD's "Bulldozer" core will feature two integer clusters per core - this will provide an increase in multithreading capabilities at the cost of 5% extra die space. Each cluster is fed from the instruction decoder, which chooses which path the instructions will take: they can go to the FPU, or one of the integer clusters. Each of the clusters has it's own L1 data cache to provide data locality since each of these clusters will be exposed as a processor to the OS - but they will both share the FPUs.
AMD has not opted to implement simultaneous multithreading(like HyperThreading, called SMT) into the the new design, but has instead opted to provide dedicated execution hardware. This move will obviously yield a bigger performance increase than the one that would be possible with SMT. AMD is targeting an 80% improvement in integer performance per core:

This slide shows a 50% increase in die area but take that value as only for the cores. The biggest part of die size today is taken up by caches, both L2 and L3:

This is the six-core "Istanbul" Opteron. Increasing die size by 50% with only double the L1 data caches per core and some extra hardware would be quite a feat. The 5% figure sounds quite about right. Also remember that the L3 is an even bigger part of the die in designs not limited by die size, as is the case with the regular quad core Opteron and Phenom II.

AMD obviously doesn't believe that SMT will make a difference in the coming years and I agree with them when it comes to desktop processors. On servers there are different motivations when designing processors and Sun's Niagara based T2+ is a prime example of that. Sun's chip is capable of 8 threads per core, by using SMT, and is engineered to extract the best possible performance per mm² and power consumption. The T2 and T2+ are similar to what Nvidia has built with the "Fermi" architecture but at a lesser extent: while Nvidia has enough parallelism available in graphics to build a core that can execute a different thread each clock(~22 at the same time, per SM, in the GT200), it doesn't make much sense for from the software side for Sun to go that far and the T2 also has bigger caches available to hide latency(the pipeline is also smaller).
Since the ammount of hugely parallel software is still small in desktop applications, it doesn't make much sense to move desktop processors to simpler designs that heavily feature SMT as a way to hide latency. Could they do that, AMD would be able to remove the big chunks of hardware necessary to support a very efficient instruction level parellelism targeted core like today's Phenom II and Core i5/i7, and could effectively target an even more parallel design for "Bulldozer" CPUs. Instead of 8 cores and 16 threads, imagine 32 cores and - at least - 128 threads per CPU, something more akin to today's GPUs, which is where we're going to in the future.
Today AMD can opt to increase the die used per core, on the day that process shrinks become even harder than they already have, when die size becomes an even bigger issue than power, be sure we'll see the mass adoption of simpler, highly threaded cores. Today it is the right move, as are the current out-of-order designs and big caches.

The floating point capabilities of the design are well covered by two 128-bit FPUs capable of executing the upcoming AVX 256-bit vector extensions created by Intel. Since the FPU is capable of performing one 256-bit FMAC(or FMAD) per clock, that equates to 256 bits * 2 (MADD) / 32 bits per float = 16 single precision FLOPS/clock per module(each with two integer cores). That equates to 64 single precision FLOPS per clock cycle in an four module design, eight core design. At a "safe" 3GHz clock, such design would be capable of 192GLOPS/s, a bit than half that from Nvidia's G80 GPU, up from 124.8GFLOPS/s of the highly capable "Istanbul" six-core at 2.6GHz.

While we haven't looked at Intel's forthcoming new micro architecture, rest assured that AMD has a very efficient, well thought out core in it's hands.
The only question lingering is if the timeframe can be met and if 2011 is an appropriate released date. Intel seems well underway with their next design and it has a foothold on the high-end desktop that will last for the coming year.

The Bit Speek

Processors

AMD's ''Bulldozer'' And AMD's Take On Threading

No comments:

Post a Comment

Popular Posts

Sponsors

Categories