I try not to hand-optimize code, preferring always to let the compiler do it. Often this works out just fine: sometimes I can see how to shorten the code a little but the compiler may be picking faster instructions and it’s best to just leave the output alone.

But sometimes the output is absurdly bad. I’m experimenting with the following functions in XCB:

XCBGetWindowAttributesCookie
XCBGetWindowAttributes (XCBConnection *c,
                        XCBWINDOW      window)
{
    return GetWindowAttributes(XCB_REQUEST_CHECKED, c, window);
}

XCBGetWindowAttributesCookie
XCBGetWindowAttributesBlind (XCBConnection *c,
                             XCBWINDOW      window)
{
    return GetWindowAttributes(0, c, window);
}

XCBGetWindowAttributesRep *
XCBGetWindowAttributesReply (XCBConnection                 *c,
                             XCBGetWindowAttributesCookie   cookie,
                             XCBGenericError              **e)
{
    return (XCBGetWindowAttributesRep *) XCBWaitForReply(c, cookie.sequence, e);
}

GCC 4.0.3 on i386/Linux produces the following code with -O2:

00000210 XCBGetWindowAttributes:
     210:       55                      push   %ebp
     211:       89 e5                   mov    %esp,%ebp
     213:       53                      push   %ebx
     214:       83 ec 14                sub    $0x14,%esp
     217:       8b 45 10                mov    0x10(%ebp),%eax
     21a:       8d 55 f8                lea    0xfffffff8(%ebp),%edx
     21d:       8b 5d 08                mov    0x8(%ebp),%ebx
     220:       89 14 24                mov    %edx,(%esp)
     223:       89 44 24 0c             mov    %eax,0xc(%esp)
     227:       8b 45 0c                mov    0xc(%ebp),%eax
     22a:       89 44 24 08             mov    %eax,0x8(%esp)
     22e:       b8 01 00 00 00          mov    $0x1,%eax
     233:       89 44 24 04             mov    %eax,0x4(%esp)
     237:       e8 fc ff ff ff          call   GetWindowAttributes
     23c:       8b 45 f8                mov    0xfffffff8(%ebp),%eax
     23f:       89 03                   mov    %eax,(%ebx)
     241:       89 d8                   mov    %ebx,%eax
     243:       8b 5d fc                mov    0xfffffffc(%ebp),%ebx
     246:       83 ec 04                sub    $0x4,%esp
     249:       c9                      leave
     24a:       c2 04 00                ret    $0x4
     24d:       8d 76 00                lea    0x0(%esi),%esi

00000250 XCBGetWindowAttributesBlind:
     250:       55                      push   %ebp
     251:       89 e5                   mov    %esp,%ebp
     253:       53                      push   %ebx
     254:       83 ec 14                sub    $0x14,%esp
     257:       8b 45 10                mov    0x10(%ebp),%eax
     25a:       8d 55 f8                lea    0xfffffff8(%ebp),%edx
     25d:       8b 5d 08                mov    0x8(%ebp),%ebx
     260:       89 14 24                mov    %edx,(%esp)
     263:       89 44 24 0c             mov    %eax,0xc(%esp)
     267:       8b 45 0c                mov    0xc(%ebp),%eax
     26a:       89 44 24 08             mov    %eax,0x8(%esp)
     26e:       31 c0                   xor    %eax,%eax
     270:       89 44 24 04             mov    %eax,0x4(%esp)
     274:       e8 fc ff ff ff          call   GetWindowAttributes
     279:       8b 45 f8                mov    0xfffffff8(%ebp),%eax
     27c:       89 03                   mov    %eax,(%ebx)
     27e:       89 d8                   mov    %ebx,%eax
     280:       8b 5d fc                mov    0xfffffffc(%ebp),%ebx
     283:       83 ec 04                sub    $0x4,%esp
     286:       c9                      leave
     287:       c2 04 00                ret    $0x4
     28a:       8d b6 00 00 00 00       lea    0x0(%esi),%esi

00000290 XCBGetWindowAttributesReply:
     290:       55                      push   %ebp
     291:       89 e5                   mov    %esp,%ebp
     293:       5d                      pop    %ebp
     294:       e9 fc ff ff ff          jmp    XCBWaitForReply
     299:       8d b4 26 00 00 00 00    lea    0x0(%esi),%esi

I believe this code, which I forced GCC to output using asm directives, is equivalent aside from the lack of a frame pointer:

00000000 XCBGetWindowAttributes:
       0:       6a 01                   push   $0x1
       2:       e9 c9 01 00 00          jmp    GetWindowAttributes

00000007 XCBGetWindowAttributesBlind:
       7:       6a 00                   push   $0x0
       9:       e9 c2 01 00 00          jmp    GetWindowAttributes

0000000e XCBGetWindowAttributesReply:
       e:       e9 fc ff ff ff          jmp    XCBWaitForReply
      13:       8d b6 00 00 00 00       lea    0x0(%esi),%esi
      19:       8d bc 27 00 00 00 00    lea    0x0(%edi),%edi

If I could get the compiler to place these functions immediately before the GetWindowAttributes function, then the two-byte version of the x86 jmp instruction would suffice for the first two functions. Then all three functions would fit neatly into a 16-byte cache line.

Expecting this sort of output seems reasonable to me. These are all tail-calls, from functions that don’t do anything scary like alloca. In the Get*Reply case, GCC even almost generates the output I want, but insists on setting up the frame pointer and then immediately tearing it down again.

The code implementing XCB’s protocol stubs is highly regular and quite simple. Maybe I should just write a version of c-client.xsl that directly generates assembly targeting particular architectures. :-)