Tuesday, May 24, 2016

Optimal optimization

For most of the history of compilers, computers have been slow, and programmer time has been more valuable than program runtime. Compilers could, in theory, transform bulky and slow software into code that's faster and smaller than any human would write by hand, but the trade-off has rarely been worthwhile. Instead we get just those optimizations that always improve code, plus a few heuristically-guided optimizations that help sometimes but may miss opportunities for improvements.

Some compilers are very explicit about their goals with regard to these trade-offs between compile time and run-time performance. Google's "Go" compiler aims for fast compile times and is less concerned about optimization, for instance. The recently open-sourced Chez Scheme compiler has the philosophy that "Compile time is important as well as run time, so compiler optimizations are generally expected to pay their own way, i.e., indirectly benefit compiler performance enough to cover the direct cost of the optimization." These are somewhat extreme examples, but generally compilers lean toward optimizations that don't require very much computation to apply.

What might a compiler look like if we were willing to solve NP-complete problems in exchange for better generated code? In this post I'll start with the problem of optimal inlining, and explore other ideas in future posts.

Inlining and its dual, procedural abstraction

One important optimization implemented in many compilers is the ability to inline the body of a function into every place where that function was called.

The immediate benefit of inlining is that a function call requires some extra instructions to set up a stack frame and then undo it when the function returns, so inlining eliminates those instructions and uses slightly less memory bandwidth.

On the other hand, if the function is called more than once, then duplicating the instructions in that function could make the total program size substantially larger. In embedded systems, that could mean having to buy more expensive parts with more storage. On desktop and server CPUs, larger programs are less likely to fit in cache and could be significantly slower than an equivalent small program. So it's important to decide carefully whether inlining is worthwhile.

Typically, compilers decide whether to inline based on heuristics. If the function is called only once, then it's pretty safe to just inline it. Otherwise, the compiler probably has some size limit: if the function is fewer than this many instructions, say, then we'll inline it.

But inlining can have indirect benefits. If some calls to a function pass a constant value as one of its arguments, and the function chooses one of several branches based on that argument, then after inlining the compiler may be able to delete most of the function body as dead code.

We could try to adjust the heuristics to notice this case, but then some even more complicated situation will come up. In general, the only way to be sure you've found all the optimization opportunities is to try, for each call site, either inlining it or not. Since inlining one call may expose new opportunities for improving another call, in principle you'd have to try all 2^n combinations. (I think that may actually be under-counting.) This is a combinatorial search problem.

Let's look at the dual of this problem for a moment. Imagine the programmer wrote a bunch of copy-pasted code that does similar things over and over. They could have extracted a common pattern out to a new function and called that function each place they needed it, but for whatever reason they didn't. The resulting program is larger than it could be as a result, but presumably the resulting straight-line code is easily optimized and fast (aside from cache effects).

What if, after optimization, we could discover that all that copy-pasted code is still very similar, and the compiler could choose to extract it to a function in order to reduce code size? Now we have the opposite dilemma: will the cost of creating a stack frame outweigh the code size advantages of extracting a function? But this time, the question is easier to answer. We've already applied all the intraprocedural optimizations we possibly could, and extracting a function can't expose any new opportunities. (At least, none I can think of; counterexamples welcome!) So we can use only local information to make the decision, no search required.

The literature I've found on this calls it "procedural abstraction", and the reason we don't generally see compilers doing it is that it's computationally very expensive to find sections of code that are similar enough that they can be extracted to new functions. I've found three general approaches:
  1. Compute a fingerprint for each basic block. If several basic blocks have the same fingerprint, they're candidates for abstraction. This only works if each block has the same instructions, in the same order (but allowing for some variation, such as different register names), so it has low odds of finding matching code. It's fast, though. See, for example, Johan Runeson's 2000 master's thesis, "Code Compression through Procedural Abstraction before Register Allocation".
  2. Use a suffix trie to identify matching sub-sequences of instructions. This still requires instructions to appear in the same order, but can match instruction sequences that are smaller than a whole basic block. It's still reasonably fast.
  3. Treat the code as a directed graph and apply an algorithm for "frequent graph-based pattern mining", such as gSpan. This is NP-complete, but can match instructions regardless of what order they're in. See, for example, "Graph-Based Procedural Abstraction" by Dreweke et al, in the 2007 Proceedings of the International Symposium on Code Generation and Optimization; or chapter 4 of Neil E. Johnson's 2004 PhD thesis, "Code size optimization for embedded processors".

Of course, for this exercise, what's a little asymptotic complexity? Obviously the hardest option is the only sensible choice.

With a procedural abstraction algorithm in hand, let's go back to the problem of deciding when inlining is a good idea. Now we can take a conceptually easy approach:

Inline everything and let procedural abstraction sort it out!

(Recursive functions need a little care. If the language semantics allow tail-call optimization, then transform tail recursion into loops where possible first. After that, identify cycles in the call-graph and leave the targets of back-edge calls un-inlined. Those details aside, inline everything else!)

At this point we've replaced a small NP-complete problem (where n is the number of function calls) with a much larger NP-complete problem (where n is the number of instructions in the program after inlining every call). But there are two reasons this is interesting:
  • We'll get the optimal set of function boundaries for whatever other optimizations we've chosen to implement. If we have an intraprocedural optimization (like dead-code elimination) that can take advantage of having a particular call inlined, then it will be inlined. If the other optimizations we've implemented don't fire after inlining, then the function will be outlined again. But in the process we'll shrink the code further by identifying redundant snippets in the original program.
  • Since the author of the compiler doesn't need to think about how inlining might interact with any other optimizations, this optimization is perfectly modular. That's great for verifying that the implementation is correct, to avoid wrong-code compiler bugs, and great for maintenance as the compiler evolves.
As a side note, I haven't found any papers yet on applying the procedural abstraction idea to re-rolling loops, but it seems like a natural extension to me. (Please let me know if you've seen this elsewhere.) If a block of code does almost the same thing several times in a row with different constants, then we can extract those constants into an array and loop over the array. Whether that's faster or slower than a series of function calls could be a tricky question, but it's probably a code-size win.

Next time...

Inlining and procedural abstraction are one example of optimization opportunities which compilers generally leave on the table to avoid excessively long compile times. With all the complexity in a modern compiler it should be no surprise that there are more opportunities!

I have one more post like this planned; perhaps more will occur to me later. I'll write about the Value State Dependence Graph, and opportunities to improve on optimizations described in papers like Alan C. Lawrence's 2007 PhD thesis, "Optimizing compilation with the Value State Dependence Graph", among others. I think I can outline a unified framework for register allocation, code motion, and instruction selection, that would maximize performance for a variety of instruction set architectures with instruction-level parallelism. But we'll see how that turns out, next time!

6 comments:

  1. How do you decide which edge is the "back edge"? For instance, you might have mutually recursive functions and callers of both. Choosing a different call to not inline would likely produce different optimizations.

    Regarding procedural abstraction: have you found any practical tools for detecting and helping to extract potential new functions on an existing codebase?

    ReplyDelete
    Replies
    1. That's the same as an irreducible control flow graph. There are ways to handle them in traditional loop optimizations, I guess you could do the same for a call graph.

      Delete
    2. For what it's worth, I wrote about reducibility in flow graphs a few weeks ago. I think it's an interesting observation that tail-calls can describe control-flow graphs which aren't reducible. If someone were to implement optimizations the way I'm describing, I think it would absolutely be worth-while to transform arbitrary graphs of tail calls into control flow graphs, even if the result is not reducible. But remember you can only do that for tail calls, because non-tail recursive calls need to keep both the old and new call frame.

      I think the answer to Josh's question is simpler, though. Although I wrote it poorly, what I was trying to say is that I think you should just not inline functions that have multiple callers and are part of a cycle. So in Josh's example, neither function would be inlined as each one has at least two callers: namely, the other function, as well as someplace else in the call graph.

      That said, being part of a cycle shouldn't be enough to prevent inlining. If A calls B calls C calls A, and there are other calls to A and C but none to B, then B should still get inlined into A, I think.

      To Josh's other question: I have not seen anyone applying procedural abstraction at the source level. The papers I've found only apply it on IR or machine-specific assembly. And I'm not sure if the source for any of those implementations is available, let alone usable on real projects.

      I suppose you could use debugging symbols to find the line numbers corresponding to instructions that you're merging, and let the user figure out how to do the function extraction. Alternately, I suspect applying an algorithm like gSpan to an abstract syntax tree might work well enough. I think this is a research problem, though, sadly.

      Delete
  2. Correct me if I am wrong.

    A back edge and a head with all basic blocks in between defines a natural loop, where there is no other edge leading to the loop blocks, except for the predecessor of the loop head.

    If there are multiple entry to a loop body, it is a irreducible loop, and normally optimization will be skipped on irreducible loops.

    ReplyDelete
  3. Nice article! But saying compilers don't do this kind of thing "to avoid excessively long compile times" is understating the problem. Exhaustive inlining will create an exponential number of copies of a function in the worst case. Measurements on Java programs showed that call graphs can have "10^14 acyclic paths or more":

    http://suif.stanford.edu/~jwhaley/papers/pldi04.pdf (see the abstract)

    It would take some serious cleverness or fundamental advances to even represent the exhaustively inlined program in a manner amenable to doing subsequent procedural abstraction. That said, it would be interesting to do something like this as a limit study on smaller programs, to get some idea of how much performance is being left on the table.

    ReplyDelete
    Replies
    1. Well said, Manu, and thank you for the reference. I suppose saying "excessively long compile times" is like saying "the heat death of the universe is a ways off yet". :-) I would love to see a limit study done like you propose!

      Regarding cleverness of representation: I have a feeling which I can't justify, that maybe you can do the analysis needed for procedural abstraction before inlining, then merge and update analysis results efficiently during inlining and subsequent optimizations. I could imagine that not only the final function extraction might be efficient, but also that optimizations could be applied to all copies of a code fragment at once by taking advantage of the same analysis. I haven't thought through this enough to see how to do that though.

      Anyway, thanks for the insightful comment.

      Delete