Some Notes on the Abstraction Penalty for IA86 C++ Compilers

In recent years there have been many interesting developments in C++ coding styles. We've seen the introduction of C++ metaprogramming techniques1, many of which offer great advantages to development as they allow simultaneous abstraction (making code more readable and maintainable) and efficiency (because templates allow computations such as the determination of type information to be executed at compile time, not runtime).

Unfortunately many compilers simply aren't up to the task of offering the efficiency advantages. Writing code with abstraction instead results in extra layers of function calling that slow down execution and, more importantly, separate blocks of code which would be optimised away if only they were brought together.

What follows is a short investigation of the compilation of a short piece of code that arose as a result of some real development.

Here is the function under consideration:


int test1(int z) {
    Matrix<Static<int,3,3> > m;
    Vector<Static<int,3> > x;
    m[0][0] = 1;
    m[0][1] = 0;
    m[0][2] = 0;
    m[1][0] = 0;
    m[1][1] = z;
    m[1][2] = 0;
    m[2][0] = 0;
    m[2][1] = z;
    m[2][2] = z;
    x[0] = 1;
    x[1] = 2;
    x[2] = 3;
    return (m*x)[2];
}
(See below for links to the full code.) operator*() has been overriden to provide the usual matrix/vector multiply operation and operator[]() accesses the elements of a vector or matrix in the way you might expect. A quick computation shows that this routine in fact returns 5*z. But because of these overrides the above code implicitly contains many function calls. For example there are 22 calls to operator[](). As a result it is a highly non-trivial task for the compiler to unpack all of these function calls, and perform the required algebra, to spot that the same result can be found with a single multiplication.

GCC

So let's start with gcc. According to urban legend gcc optimises poorly. Unfortunately much of the open source in existence today, including almost every installation of Linux (and MacOSX), is built with gcc. Here is what version 3.3.1 of gcc generates:

	g++ -O5 -S -o main.S main.cpp

	pushl	%ebp
	movl	$1, %eax
	movl	%esp, %ebp
	pushl	%esi
	leal	-56(%ebp), %ecx
	pushl	%ebx
	subl	$240, %esp

	... 120 lines deleted...

	movl	%edx, 4(%esp)
	call	__ZmlI6StaticIiLi3ELi3EES0_IiLi3ELi0EEE6VectorI17TimesMatrixVectorIT_T0_EERK6MatrixIS5_ERKS3_IS6_E
	movl	%eax, -224(%ebp)
	movl	%edx, -220(%ebp)
	movl	-224(%ebp), %edx
	movl	-220(%ebp), %ecx
	addl	$24, %edx
	.p2align 4,,15
L56:
	movl	(%ecx), %eax
	addl	$4, %ecx
	imull	(%edx), %eax
	addl	$4, %edx
	addl	%eax, %esi
	decl	%ebx
	jns	L56
	addl	$240, %esp
	movl	%esi, %eax
	popl	%ebx
	popl	%esi
	popl	%ebp
	ret
That's about 150 instrcutions. But actually it's much worse than it looks. Notice that it's actually calling a separate matrix times vector routine that hasn't been inlined. And that is where the actual guts of the code lie. What you've seen above is merely the code to set up a call to this routine. It's astonishingly bad.

We can give gcc some clues by asking it to inline functions and unroll all loops. This trims about 5% off the length of this function.

Microsoft

The Microsoft compiler is the de facto standard for the Windows platform. It isn't particularly famed for its optimisation and is frequently the butt of jokes about the dreaded Internal Compiler Error. Here is the code generated by the Microsoft C/C++ Compiler, version 13.10.3052 which comes with the free Microsoft Visual C++ Toolkit 2003:

	cl /GX /Ox /Fa main.cpp

	mov	eax, DWORD PTR _z$[esp-4]
	lea	eax, DWORD PTR [eax+eax*4]
	ret	0
Note that the core of this routine is the single lea instruction that performs an integer multiply by five.

Intel

According to folklore the Intel compiler is the best that can be bought for the Intel platforms. This doesn't accord well with my personal experience. I frequently find my code is 20-50% slower than that generated by the Microsoft compiler. Unfortunately I am unable to run a full test now as Intel don't appear to provide a stand alone compiler. They appear only provide a piece of software that is parasitic on a full Microsoft Visual C++ installation and can't function on its own. (The error message was icl: error: unable to run '<Microsoft VC++ Dir>\\Bin'.) When I had access to Visual C++ I tested my example code with version 8.1 of the Intel compiler for IA86 and I found it compiled to something like 50 instructions with lots of optimisations switched on. This is much better than gcc, but an order of magnitude (or two!) worse than the Microsoft compiler.

Digital Mars

Tell me how to get an assembly listing out of the free evaluation of this compiler and I'll test my code.

Conclusions

For modern C++ coding styles the abstraction penalty can be severe. Through the use of constant folding and constant propagation the Microsoft compiler is able to alleviate much of that penalty. gcc and the Intel compiler still have room for improvement. Unfortunately this assessment isn't reflected in benchmarks. Many C++ benchmarks use code that with very little abstraction and is quite Fortran-like. The Intel compiler excels with this type of code. But for abstract code, I unhesitatingly recommend use of the Microsoft compiler.

References

[1] Template Metaprograms, Todd Veldhuizen

Downloads