Optimal optimization

For most of the history of compilers, computers have been slow, and programmer time has been more valuable than program runtime. Compilers could, in theory, transform bulky and slow software into code that’s faster and smaller than any human would write by hand, but the trade-off has rarely been worthwhile. Instead we get just those optimizations that always improve code, plus a few heuristically-guided optimizations that help sometimes but may miss opportunities for improvements.

Some compilers are very explicit about their goals with regard to these trade-offs between compile time and run-time performance. Google’s “Go” compiler aims for fast compile times and is less concerned about optimization, for instance. The recently open-sourced Chez Scheme compiler has the philosophy that “Compile time is important as well as run time, so compiler optimizations are generally expected to pay their own way, i.e., indirectly benefit compiler performance enough to cover the direct cost of the optimization.” These are somewhat extreme examples, but generally compilers lean toward optimizations that don’t require very much computation to apply.

What might a compiler look like if we were willing to solve NP-complete problems in exchange for better generated code? In this post I’ll start with the problem of optimal inlining, and explore other ideas in future posts.

Inlining and its dual, procedural abstraction

One important optimization implemented in many compilers is the ability to inline the body of a function into every place where that function was called.

The immediate benefit of inlining is that a function call requires some extra instructions to set up a stack frame and then undo it when the function returns, so inlining eliminates those instructions and uses slightly less memory bandwidth.

On the other hand, if the function is called more than once, then duplicating the instructions in that function could make the total program size substantially larger. In embedded systems, that could mean having to buy more expensive parts with more storage. On desktop and server CPUs, larger programs are less likely to fit in cache and could be significantly slower than an equivalent small program. So it’s important to decide carefully whether inlining is worthwhile.

Typically, compilers decide whether to inline based on heuristics. If the function is called only once, then it’s pretty safe to just inline it. Otherwise, the compiler probably has some size limit: if the function is fewer than this many instructions, say, then we’ll inline it.

But inlining can have indirect benefits. If some calls to a function pass a constant value as one of its arguments, and the function chooses one of several branches based on that argument, then after inlining the compiler may be able to delete most of the function body as dead code.

We could try to adjust the heuristics to notice this case, but then some even more complicated situation will come up. In general, the only way to be sure you’ve found all the optimization opportunities is to try, for each call site, either inlining it or not. Since inlining one call may expose new opportunities for improving another call, in principle you’d have to try all 2\^n combinations. (I think that may actually be under-counting.) This is a combinatorial search problem.

Let’s look at the dual of this problem for a moment. Imagine the programmer wrote a bunch of copy-pasted code that does similar things over and over. They could have extracted a common pattern out to a new function and called that function each place they needed it, but for whatever reason they didn’t. The resulting program is larger than it could be as a result, but presumably the resulting straight-line code is easily optimized and fast (aside from cache effects).

What if, after optimization, we could discover that all that copy-pasted code is still very similar, and the compiler could choose to extract it to a function in order to reduce code size? Now we have the opposite dilemma: will the cost of creating a stack frame outweigh the code size advantages of extracting a function? But this time, the question is easier to answer. We’ve already applied all the intraprocedural optimizations we possibly could, and extracting a function can’t expose any new opportunities. (At least, none I can think of; counterexamples welcome!) So we can use only local information to make the decision, no search required.

The literature I’ve found on this calls it “procedural abstraction”, and the reason we don’t generally see compilers doing it is that it’s computationally very expensive to find sections of code that are similar enough that they can be extracted to new functions. I’ve found three general approaches:

Compute a fingerprint for each basic block. If several basic blocks have the same fingerprint, they’re candidates for abstraction. This only works if each block has the same instructions, in the same order (but allowing for some variation, such as different register names), so it has low odds of finding matching code. It’s fast, though. See, for example, Johan Runeson’s 2000 master’s thesis, “Code Compression through Procedural Abstraction before Register Allocation”.
Use a suffix trie to identify matching sub-sequences of instructions. This still requires instructions to appear in the same order, but can match instruction sequences that are smaller than a whole basic block. It’s still reasonably fast.
Treat the code as a directed graph and apply an algorithm for “frequent graph-based pattern mining”, such as gSpan. This is NP-complete, but can match instructions regardless of what order they’re in. See, for example, “Graph-Based Procedural Abstraction” by Dreweke et al, in the 2007 Proceedings of the International Symposium on Code Generation and Optimization; or chapter 4 of Neil E. Johnson’s 2004 PhD thesis, “Code size optimization for embedded processors”.

Of course, for this exercise, what’s a little asymptotic complexity? Obviously the hardest option is the only sensible choice.

With a procedural abstraction algorithm in hand, let’s go back to the problem of deciding when inlining is a good idea. Now we can take a conceptually easy approach:

Inline everything and let procedural abstraction sort it out!

(Recursive functions need a little care. If the language semantics allow tail-call optimization, then transform tail recursion into loops where possible first. After that, identify cycles in the call-graph and leave the targets of back-edge calls un-inlined. Those details aside, inline everything else!)

At this point we’ve replaced a small NP-complete problem (where n is the number of function calls) with a much larger NP-complete problem (where n is the number of instructions in the program after inlining every call). But there are two reasons this is interesting:

We’ll get the optimal set of function boundaries for whatever other optimizations we’ve chosen to implement. If we have an intraprocedural optimization (like dead-code elimination) that can take advantage of having a particular call inlined, then it will be inlined. If the other optimizations we’ve implemented don’t fire after inlining, then the function will be outlined again. But in the process we’ll shrink the code further by identifying redundant snippets in the original program.
Since the author of the compiler doesn’t need to think about how inlining might interact with any other optimizations, this optimization is perfectly modular. That’s great for verifying that the implementation is correct, to avoid wrong-code compiler bugs, and great for maintenance as the compiler evolves.

As a side note, I haven’t found any papers yet on applying the procedural abstraction idea to re-rolling loops, but it seems like a natural extension to me. (Please let me know if you’ve seen this elsewhere.) If a block of code does almost the same thing several times in a row with different constants, then we can extract those constants into an array and loop over the array. Whether that’s faster or slower than a series of function calls could be a tricky question, but it’s probably a code-size win.

Next time…

Inlining and procedural abstraction are one example of optimization opportunities which compilers generally leave on the table to avoid excessively long compile times. With all the complexity in a modern compiler it should be no surprise that there are more opportunities!

I have one more post like this planned; perhaps more will occur to me later. I’ll write about the Value State Dependence Graph, and opportunities to improve on optimizations described in papers like Alan C. Lawrence’s 2007 PhD thesis, “Optimizing compilation with the Value State Dependence Graph”, among others. I think I can outline a unified framework for register allocation, code motion, and instruction selection, that would maximize performance for a variety of instruction set architectures with instruction-level parallelism. But we’ll see how that turns out, next time!