Processors, Software

Optimizing your Intel Atom platform


Ever since Intel came out with the Atom specifications, there was something that struck me about that in-order pipeline: it could be further exploited in software, you could mitigate some of the performance lost from an out-of-order architecture.
Then, having a 1.6GHz core that really performs like a typical 1.6GHz desktop CPU, at 4W of power consumption, would be something to really take into consideration.


In-order processors aren't a bad thing, as most game consoles will show you. The PS3 and Xbox 360 both use in-order cores, and the Wii possibly also(not much details are available about "Broadway"). That is the sensible option to take, since you throw a lot of chip complexity out the window, especially when you have control of the software "ecosystem" and how you compile it. A proper compiler can predictably schedule operations at compile time, therefore bypassing most of the need for instruction re-ordering at runtime. This is made possible because, since you're compiling code for precisely one architecture, you know how many cycles a cache acessess takes, how long is the pipeline, etc. You don't have to deal with many different variations, so you don't have to compile generically - that information is moved from the re-ordering algorithm to the compiler.

The Atom isn't the return of Intel to in-order architectures, the Itanium was it. Being the VLIW based architecture it is, it can, and should, rely heavily on compiler optimization to optimize performance, and so it does. Of course that in the "Wintel" world this is not possible, and out-of-order execution is king, the latest proof of that is Via's switch to this type of architecture with the Nano - it all depends of the market you're competing in.

Of course this will not always be the case, an out-of-order CPU will always have some additional advantages, especially the more recent designs, e.g. when accessing memory.

But, back to the Atom:
To exploit the additional power, to hide the additional latency, besides using HyperThreading, we can use more recent compilers and open-source software.
If you have picked up an Atom system that already comes with Linux, all you have to do is recompile the most demanding applications with a proper compiler for the Atom, currently Intel's C compiler(GCC still doesn't support it).
Using at least ICC 10.1 with the -xL flag, you can compile software with compiler optimizations for the Atom's in-order architecture.

A blogger has posted some benchmarks on optimizations done with video encoding and there are performance increases, although not by much(10-15%), and x264 doesn't show any improvement:


Keep in mind that these are video encoding benchmarks, which are already adequate to these types of CPUs. The CPU streams through an amount of data, not randomly accessing it, and, therefore, not suffering additional penalties from not being able to do out-of-order memory accesses, for example.
I'd expect a completely different type of application to benefit even more from proper architecture optimization, something that will also improve with time, as new versions of ICC are released and GCC gathers support.
In fact, these benchmarks are already outdated, since Intel has already released ICC 11.0, which an Intel employee has already referred to as having addittional performance improvements:

With the upcoming 11.0 compiler the option switch will be changed to /QxSSE3_ATOM and additional optimizations focussing on address generation for memory access and taking advanatge of Atom specific instructions are being implemented.

This is not something out of the ordinary, as a comercial product has already taken a similar approach: the Fit-PC, a Geode LX thin client, allows you to purchase it with Gentoo installed instead of Ubuntu(or together with), where proper compiler optimizations for the Geode were used, and not just generic i586 optimizations.

Just head to Intel's website and download ICC, since it is free for non-commercial uses. While I don't have an Atom to play with, I have already used ICC in some Xeon processors and it increased performance in calculation intensive applications to about 30% more compared to GCC 4.3.2 stock, and to 10-15% when compared to GCC 4.3.2 with major tunning.

No comments:

Post a Comment