Irritating GCC code generation
I try not to hand-optimize code, preferring always to let the compiler do it. Often this works out just fine: sometimes I can see how to shorten the code a little but the compiler may be picking faster instructions and it’s best to just leave the output alone.
But sometimes the output is absurdly bad. I’m experimenting with the following functions in XCB:
XCBGetWindowAttributesCookie
XCBGetWindowAttributes (XCBConnection *c,
XCBWINDOW window)
{
return GetWindowAttributes(XCB_REQUEST_CHECKED, c, window);
}
XCBGetWindowAttributesCookie
XCBGetWindowAttributesBlind (XCBConnection *c,
XCBWINDOW window)
{
return GetWindowAttributes(0, c, window);
}
XCBGetWindowAttributesRep *
XCBGetWindowAttributesReply (XCBConnection *c,
XCBGetWindowAttributesCookie cookie,
XCBGenericError **e)
{
return (XCBGetWindowAttributesRep *) XCBWaitForReply(c, cookie.sequence, e);
}
GCC 4.0.3 on i386/Linux produces the following code with -O2:
00000210 XCBGetWindowAttributes:
210: 55 push %ebp
211: 89 e5 mov %esp,%ebp
213: 53 push %ebx
214: 83 ec 14 sub $0x14,%esp
217: 8b 45 10 mov 0x10(%ebp),%eax
21a: 8d 55 f8 lea 0xfffffff8(%ebp),%edx
21d: 8b 5d 08 mov 0x8(%ebp),%ebx
220: 89 14 24 mov %edx,(%esp)
223: 89 44 24 0c mov %eax,0xc(%esp)
227: 8b 45 0c mov 0xc(%ebp),%eax
22a: 89 44 24 08 mov %eax,0x8(%esp)
22e: b8 01 00 00 00 mov $0x1,%eax
233: 89 44 24 04 mov %eax,0x4(%esp)
237: e8 fc ff ff ff call GetWindowAttributes
23c: 8b 45 f8 mov 0xfffffff8(%ebp),%eax
23f: 89 03 mov %eax,(%ebx)
241: 89 d8 mov %ebx,%eax
243: 8b 5d fc mov 0xfffffffc(%ebp),%ebx
246: 83 ec 04 sub $0x4,%esp
249: c9 leave
24a: c2 04 00 ret $0x4
24d: 8d 76 00 lea 0x0(%esi),%esi
00000250 XCBGetWindowAttributesBlind:
250: 55 push %ebp
251: 89 e5 mov %esp,%ebp
253: 53 push %ebx
254: 83 ec 14 sub $0x14,%esp
257: 8b 45 10 mov 0x10(%ebp),%eax
25a: 8d 55 f8 lea 0xfffffff8(%ebp),%edx
25d: 8b 5d 08 mov 0x8(%ebp),%ebx
260: 89 14 24 mov %edx,(%esp)
263: 89 44 24 0c mov %eax,0xc(%esp)
267: 8b 45 0c mov 0xc(%ebp),%eax
26a: 89 44 24 08 mov %eax,0x8(%esp)
26e: 31 c0 xor %eax,%eax
270: 89 44 24 04 mov %eax,0x4(%esp)
274: e8 fc ff ff ff call GetWindowAttributes
279: 8b 45 f8 mov 0xfffffff8(%ebp),%eax
27c: 89 03 mov %eax,(%ebx)
27e: 89 d8 mov %ebx,%eax
280: 8b 5d fc mov 0xfffffffc(%ebp),%ebx
283: 83 ec 04 sub $0x4,%esp
286: c9 leave
287: c2 04 00 ret $0x4
28a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
00000290 XCBGetWindowAttributesReply:
290: 55 push %ebp
291: 89 e5 mov %esp,%ebp
293: 5d pop %ebp
294: e9 fc ff ff ff jmp XCBWaitForReply
299: 8d b4 26 00 00 00 00 lea 0x0(%esi),%esi
I believe this code, which I forced GCC to output using asm
directives, is equivalent aside from the lack of a frame pointer:
00000000 XCBGetWindowAttributes:
0: 6a 01 push $0x1
2: e9 c9 01 00 00 jmp GetWindowAttributes
00000007 XCBGetWindowAttributesBlind:
7: 6a 00 push $0x0
9: e9 c2 01 00 00 jmp GetWindowAttributes
0000000e XCBGetWindowAttributesReply:
e: e9 fc ff ff ff jmp XCBWaitForReply
13: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
19: 8d bc 27 00 00 00 00 lea 0x0(%edi),%edi
If I could get the compiler to place these functions immediately before the GetWindowAttributes function, then the two-byte version of the x86 jmp instruction would suffice for the first two functions. Then all three functions would fit neatly into a 16-byte cache line.
Expecting this sort of output seems reasonable to me. These are all tail-calls, from functions that don’t do anything scary like alloca. In the Get*Reply case, GCC even almost generates the output I want, but insists on setting up the frame pointer and then immediately tearing it down again.
The code implementing XCB’s protocol stubs is highly regular and quite
simple. Maybe I should just write a version of c-client.xsl
that
directly generates assembly targeting particular architectures. :-)