Graphics Cards

CUDA, AMD Stream SDK And Unexpected Transcoding Results


Both AMD and Nvidia have promised that GPUs would accelerate everyday computing tasks. It's an effort to sway people from using processors more than anything, a market where Intel is dominant and where AMD will profit either way.
Most of the consumer grade software that has become available has failed providing either proper performance or quality when compared with software that runs on the CPU. One such case is that of the video transcoding applications.

Xbitlabs recently compared AMD's most recent integrated chipset - the 785G - to Intel's best: the G45. This is what they found when using transcoding:


The performance benefit from using ATI Stream is close to 40%, which is a very good result when considering that the AMD machine is already running on a Phenom II X2 550 that runs at 3.1GHz.
The IGP doesn't have more bandwidth than the CPU to begin with(it actually has less) so it has to rely on pure processing power of the 40 shader processors that are embedded inside. The 40 shaders are really only eight 5-way SIMD processors, similar to the cores on the Phenom II but without advanced branch prediction, some other more complex stuff and a lower clock but the same SIMD capabilities. If one were to classify the Phenom II like a shader processor, it would be roughly equivalent to 8 SPs(two cores capable of 128bit (32*4) floating point operations per clock).
As for processing power, the PII X2 550 delivers around 24 GFLOPs/s - 3.1*2*128/32 - of peak throughput and the 785G is capable of 40 GFLOPs/s - 500MHz*40*2 - when you factor in the FMADD capability of the shader cores that can perform a multiply and an add in the same clock. That equals to 67% more processing power than the Phenom II, which is roughly what real figures translate to - 40% isn't a bad result at all.
The reviewer mentions that the CPU utilization is close to zero during transcoding with ATI Stream, so it's obvious that there's no cooperation between CPU and GPU. The GPU is not helping, it's doing all the work.

This kind of performance gain is great if one wants to keep power consumption low or only has a slow cpu, like in sub-laptop machines. AMD only delivers 690G based sub-laptops but has announced Stream capable 780G variants to replace the likes of the HP DV2 and Gateway's Athlon Neo powered sibling.

The big problem with video transcoding on the GPU are the unexpected quality issues. The video transcoded with the CPU has always a different quality, always better. There are artifacts or other types of problems that pop up when using transcoding software for the GPU. Why is this?

One of the main problems while performing calculations on GPUs is that you must use single precision, 32bits, to achieve decent performance gains compared to the CPU. There's a problem in GPUs tough: they aren't compliant with the IEEE754 standard for floating point arithmetic.
Usually results don't differ much to be significant but sometimes the results can be quite bad for a given application. The problem may come from the fact that some rounding modes aren't supported by either Nvidia's or ATI's GPUs or that something is also happenning internally in regards to the number of bits to store intermediate results. From my experience, the lowest and highest numble that can be stored in 32bit floating point format tends to be very close to the number that the CPU produces but may be significant in some cases. Neither AMD or Nvidia fully document what happens inside the hardware, so it's kind of a guessing game about what goes inside than anything else.

On the other hand, starting with the Radeon HD 3870(RV670) and Nvidia's GeForce GTX(GT200), both companies support an IEEE 754 compliant 64bit floating point arithmetic. The problem is that the 32bit implementation hasn't been fixed yet and is a hurdle in porting some codes. Using 64bit double precision floating point is possible in some algorithms but the performance drop makes it impossible to efficiently use it in these kinds of applications. While CPUs experience a drop in performance to half of what they can do in 32bits, the GT200 chip drops to one tenth of the performance and ATI to approximatly one fifth. Nvidia uses one 64bit ALU per shader processor but AMD can use the 128 bit SIMD cores for the job, processing two doubles per clock - the transcendental ALU can't do doubles and the SIMD core can not do FP MADD though.
Since ATI has less but wider shader processors, the RV670+ chips see a smaller drop in performance than Nvidia's cards. It's too big of a drop either way, so there's not much hope with doubles, at least for video transcoding. There are some scientific algorithms that use them, although in something like 1% or less of all operations performed, hence not hurting performance considerably.

The software itself is also to blame not just the cards: Nvidia's software usually can produce better results than equivalent AMD Stream capable transcoding software, altough single precision hardware implementations aren't considered different for both vendors - they complied with what DirectX and OpenGL standards demand for shader processing.

In the end, while both companies don't fix the lack of IEEE754 compliance in single precision, there will exist some applications that will always produce bad or strange results. AMD seems to have glued two RV770 chips together with the Radeon HD 5800 series, so it will be hard to find the proper silicon there. Nvidia, being more committed to the GPGPU initiatives - given the lack of x86 license and impending chipset and Larrabee "apocalypse" - will be pushing harder on that front and we may see it fixed on the GT300 - I wouldn't be on it tough.

No comments:

Post a Comment