The New Intel C Compiler for LinuxTable of Contents:
Recently, Intel released its newest compilers for C/C++ and Fortran77/90 for the Linux platform. Calling them simply C++ and Fortran Compilers 5.0 does a them bit of disservice. These compilers have been long awaited by developers and it seems the wait has been worth it. Most serious developers on the Intel architecture know that while these processors are capable of performance rivaling that of high end workstation CPU's, making the Intel's peak performance available required the purchase of commercial libraries or hours of tedious assembly coding. This is due to the fact that there exists on every Intel processor since the Pentium III, a special feature known as SSE. SSE is really just a marketing term for the Pentium III and later's high powered floating point unit. SSE is actually composed of 8 128bit registers whose sole purpose is to perform floating point arithmetic. As you will see later in the article, utilizing these extra registers to their fullest potential can lead to tremendous performance increases in applications which make heavy use of floating point calculations.
Fortunately for those of us who use Linux, the compilers can be had for free for non-commercial use. For Windows users in an academic environment, the cost is relatively low as well at under $100 per user.
Besides the price, the Intel compilers have several other benefits:
|Commonality with GCC.||If you are familiar with GCC's syntax and switches, the transition to the Intel compilers will not be difficult.|
|Increased Performance||Base performance increase can be expected to be around 30% and over 100% under certain circumstances|
|Visual Studio compatibility.||While we have not tested this feature here at the MCSR, Intel claims that code created by the Microsoft compiler is fully compatible with code created with the Intel compiler.|
There are few drawbacks to using the Intel compilers. One of the things that I noticed is that icc (how the Intel compilers are invoked) is a bit more picky about the code it will compile. There were a few instances where gcc let me get away with a bad malloc() statement without even a warning, but icc refused to compile the exact same code. After a bit of debugging, I noticed that I had commented out the proper malloc statement and replaced it with a bad one for reasons unknown. The point is this; icc, especially when compiling C++ code, is pickier about syntax and other things like memory allocation. Whether or not this is actually a bad thing is left to the reader to decide.
The only serious drawback for Linux users is that the Intel compilers will not compile to Linux kernel.
When testing the Intel compiler's performance, I used three optimization switches.
|-O3||This is roughly similar to GCC's -O3 switch. It provides lots of high level optimization.|
|-xK||This switch turns on SSE optimization. It makes a huge impact on performance.|
|-wp_ipo||This switch turns on interprocedural optimization. It does many different things like process inlining and loop unrolling, or whatever the compiler thinks will improve the performance of your code. Note that this switch will only work if the entire source code for a program is available in one file. The multi-file equivalent is -ipo.|
When compiling the programs with GCC, I used the following switches.
-O9 -funroll-loops -ffast-math -fomit-frame-pointer -malign-double -mcpu=pentiumpro -finline-functions -march=pentiumpro -fno-exceptions
Note that I was using GCC version 2.95.3.
Performance vs. GCC
I used three main benchmarks: Stream, a memory performance benchmark; Whetstone, a floating point performance benchmark; and a program that I wrote that is also floating point intensive. To save time and frustration, I set up an environment variable "FASTGCC" that contains the switches mentioned above.
Stream is a well known benchmark used to test the memory performance of computers. I edited its makefile to reflect my choice of compilers and to ensure that the correct optimization switches were being used for each one. It was interesting to see that there was a significant performance difference between the two compilers. The difference likely comes from that fact that the Intel compiler more aggressively aligns the data structures in memory resulting in optimal memory controller performance while gcc takes a more "get in where you fit in" approach.
Note that this benchmark requires a bit of tweaking from system to system in order to produce valid results. On my system, I increased the array size to
Results using icc
Results using gcc
Whetstone is a well known benchmark normally used to test the floating point performance of processors. I edited its makefile to reflect my choice of compilers and to ensure that the correct optimization switches were being used for each one. When running the benchmark, the user must specify one parameter, the number of iterations to perform. As you can see, I chose to perform 1 million iterations. The results you see below are typical. Variance between runs was less than 5%.Using icc
jjake@ars:~/bench$ ./iccWhet 1000000
Loops: 1000000, Iterations: 1, Duration: 138 sec.
C Converted Double Precision Whetstones: 724.6 MIPS
jjake@ars:~/bench$ ./gccWhet 1000000
Loops: 1000000, Iterations: 1, Duration: 277 sec.
C Converted Double Precision Whetstones: 361.0 MIPS
The results are impressive. The Intel compilers enjoy an approximately 100% performance lead. This is surely the result of using the SSE registers. GCC leaves these untapped and is therefore thoroughly defeated.
One oddity did occur during the testing of this benchmark, however. When using the -wp_ipo optimization switch with the Intel compiler, performance skyrocketed. The results are included here, but take them with a grain of salt. This seems almost too good to be true.
Questionable icc results
jjake@ars:~/bench$ ./otherWhet 1000000
Loops: 1000000, Iterations: 1, Duration: 41 sec.
C Converted Double Precision Whetstones: 2439.0 MIPS
This is a truly amazing performance increase, approximately 675%! It is certainly not typical, however, so your mileage may vary.
This program performs millions of floating point adds, subracts, compares, and multiplies. It also utilizes the rand(), cos(), sin(), tan(), and sqrt() functions from the C math library. The results you see below are representative of typical results. The differences between runs of this benchmark were not significant and varied by less than 5%.
As you can see, I used the Unix command "time" to measure how long it took each of these programs to execute. The results are clear. A typical run for the exact same code compiled with "icc -O3 -xK -wp_ipo test.c -o iccTest -lm" and "gcc $FASTGCC test.c -o gccTest -lm" show over a 30 second difference in execution time.
jjake@ars:~/handyStuff$ time gccTest
The difference is over 400% in favor of the Intel compiler. That is a very significant speed increase over gcc.
Keep in mind that these benchmarks focused only on two areas where the Intel compilers are known to be superior, memory performance and floating point performance. The differences when compiling code that contains mainly integer operations will not be as significant. Also, comparing two compilers can not be done properly using only the benchmarks presented here. This article should only serve to increase your interest in these new tools.
Making the switch to a different compiler is not a trivial task. Currently, Intel provides the compiler in RPM format for RedHat 6.2 or 7.1 operating systems only. With some creative work, however they can be made to work on other Linux distributions.
In time, the MCSR may opt to install the Intel compilers on Mimosa, our Linux cluster. Stay tuned to the MCSR homepage for news regarding this.
A good benchmark repository