Monday, June 19, 2017

Search-based compiler code generation

I've been thinking about something for a while now, and promised a year ago in my Optimal Optimization post that I would write it up.

One of the difficulties in writing a programming language compiler is that there is a tension between making the pieces of the compiler modular and maintainable, versus making the compiler produce good code. The biggest cause of this tension is over "phase ordering", and I'll defer to one of the many academic papers on the subject to explain it:
Current optimizing compilers typically contain several different optimization phases. Each phase attempts to apply a series of transformations, each of which consists of a sequence of changes that preserves the semantic behavior of the program while typically improving its efficiency. Many of these optimization phases use and share resources (such as machine registers), and also need specific conditions in the code to be applicable. As a result, optimization phases interact with each other by enabling and disabling opportunities for other phases to be applied. (Prasad Kulkarni et al, "Practical Exhaustive Optimization Phase Order Exploration and Evaluation")
Researchers have explored a variety of ways to determine, for a given piece of source code, which passes should be applied and in what order. The above paper, and several others by the same authors, describe efficient ways to search over all possible phase orders. Many people have applied genetic algorithms or machine learning techniques to this problem as well; for an interesting example plus good citations to earlier work, see Sameer Kulkarni & John Cavazos, "Mitigating the Compiler Optimization Phase-Ordering Problem using Machine Learning".

But you might ask, why do we divide the compiler's work into phases in the first place? It's in support of our old software engineering friend: abstraction. It's much easier to reason about a particular transformation if we don't have to think about all of the others at the same time.

Still, could we avoid the phase order problem entirely by structuring the compiler in some other way?

Strongly normalizing intermediate representations

Alan C. Lawrence, in his 2007 PhD thesis ("Optimizing compilation with the Value State Dependence Graph"), claimed a maxim of compiler folklore is that "The most optimizing compiler is the most normalizing compiler," explaining it this way:
Normalization refers to where many different source programs result in the same machine code after compilation—that is, where the same machine code is produced regardless of which form was written by the programmer. … This allows the programmer to select from such representations according to which is the easiest to comprehend or to fit his ideas or program into, and leave the compiler to select the output according to efficiency concerns.
Traditionally, compilers have represented the structure of a function using a Control Flow Graph (CFG) composed of "basic blocks" containing sequential instructions, plus control-flow edges between basic blocks that describe the different paths the program may follow.

The CFG representation locks in details of evaluation order that aren't inherent in the original program, so many optimizations that are designed for CFGs just re-order instructions in various ways. Different orderings may have different performance or code size implications, so the challenge is to identify the best ordering for your particular goals.

By contrast, the Value State Dependence Graph (VSDG) does not force instructions into a sequential order, but instead keeps track of data dependencies and serialization requirements using directed edges in a graph. As a result, a VSDG describes all equivalent orderings of the same code in a single, normalized, form. (Other program representations have done that for parts of the code that are free of side effects, but VSDG extends it to control flow and memory access.)

Lawrence later noted:
Many optimizations, which would be separate passes in a CFG compiler, seem trivial in a VSDG framework, due to the VSDG’s normalizing properties. That is, they automatically (or implicitly) occur merely by representing a program as a VSDG.
However, this does not entirely avoid the problem: rather, it merely delays the decision until the IR is transformed into some form which does make the distinction, namely until the sequentialization stage. Thus, sequentialization wraps up many such optimizations together, and concentrates their difficulty into one place. However, this approach does seem helpful in addressing phase-order problems…

Although a single VSDG can represent a large number of seemingly-different CFGs, it is possible to reduce the number of equivalent CFGs just by adding extra serializing edges. The edges initially present in the graph represent true dependencies: A must be evaluated before B, either because B needs the result of A, or because A and B have side effects that must not be re-ordered. Artificially adding more dependencies imposes a specific evaluation order for the remaining computations.

The VSDG sequentialization stage can be defined as introducing serializing edges until exactly one CFG remains. Producing a sequentialization optimized for speed is NP-complete, but because the only choices are over where to add serializing edges, the VSDG is a convenient way to express whatever constraints, heuristics, or cost functions you want.

Although Lawrence's thesis discusses sequentialization extensively, if you want to know more on that subject I recommend James Stanier's 2011 PhD thesis ("Removing and Restoring Control Flow with the Value State Dependence Graph") as well as Bahmann et al's 2015 paper, "Perfect Reconstructability of Control Flow from Demand Dependence Graphs".

Combined Register Allocation and Code Motion

Neil Johnson, in chapter 6 of his 2004 PhD thesis ("Code Size Optimization for Embedded Processors"), proposed an elegant algorithm for combining a particular pair of traditionally-separate optimization passes using the VSDG, which makes a nice example of both the phase ordering problem and the ways VSDG can help solve it.

Register allocation is the process of assigning temporary values to specific hardware registers. Sometimes there are too many temporary values that the program needs to hang on to at the same time, and they can't all fit in registers at once. When that happens, there are traditionally two choices: either "spill" some of the registers to memory so they can be used for something else until the older values are needed again; or forget some temporary values and recompute them later. Both options result in a larger and slower program.

Code motion is a loop optimization that moves an expression out of the body of a loop if that expression will evaluate to the same value on every iteration of the loop. The only way this transformation could make the expression be evaluated more times than in the original program is if the loop never executes, so this is almost always an improvement.

However, code motion increases pressure on the register allocator, because there are more temporary values that need to be available throughout the execution of the loop. Under some circumstances, the program will actually be faster if it recomputes the loop-invariant expression every time it's needed, because that may free up registers and avoid expensive spills to memory.

Johnson uses the VSDG to implement a greedy algorithm that only performs code motion when it won't interfere with register allocation. His isn't the first algorithm to combine these two passes, but using VSDG makes his approach particularly easy to understand. I recommend reading his paper if you'd like to know more.

Recasting optimization as a search problem

Although several of the VSDG papers mention combinatorial search as a potential solution to various challenges, I have not found any signs that anyone has actually tried search with the VSDG. I'd be interested to see Johnson's algorithm, for example, revisited with an exhaustive search in place of the greedy policy, and an analysis of the compile-time cost versus run-time and code-size benefit of making optimal choices.

There is history of using search in various ways for code optimization:
  • I already cited "Practical Exhaustive Optimization Phase Order Exploration and Evaluation" as an example of using search to identify an optimal phase order for a traditional compiler on a given program.
  • Samuel Sanseri's 2001 Master's thesis ("Toward An Optimizing JIT Compiler For IA-64") demonstrated that for VLIW architectures, search-based instruction selection can produce more efficient code than GCC's hand-tuned heuristics. (It also promised to demonstrate that it could do so with fast enough compile time to be usable in a JIT setting, though I can't find evidence in the paper that compile time was actually measured. Still, I think the claim is plausible.)
  • João Dias' 2008 PhD thesis ("Automatically Generating the Back End of a Compiler Using Declarative Machine Descriptions") demonstrated using search to automatically generate the instruction selector in the backend of a compiler. Although compiler frontend techniques haven't changed dramatically since lex and yacc were written in the 1970s, instruction selection is still done today using extensive hand-written heuristics. Dias' approach replaces those heuristics with pattern-matching rules generated before compiling the compiler itself, so the slow search time is amortized across all the programs compiled for the same architecture.
These projects demonstrate a range of applications of combinatorial search covering both offline analysis used to generate better compilers, and online analysis to dynamically make better optimization decisions during compilation.

However, I have yet to find anyone who has tried using search to decide whether to apply a optimization/transformation at each individual location where it's valid. Since applying an optimization in a particular spot might prevent applying other optimizations in that spot, the current greedy methods may produce suboptimal results no matter which order they run optimization phases in.

Here's a rough outline of an approach that I think would be interesting to try:
  1. Construct a VSDG, and aggressively normalize it using equivalents of common subexpression elimination and partial redundancy elimination. These are not guaranteed to be a good idea, but they can be selectively undone later using node splitting if necessary. Also apply optimizations that are guaranteed to be improvements, such as constant folding, algebraic identities, and dead node elimination. Furthermore, some optimizations, like constant propagation and copy propagation, are implicit in VSDG construction (e.g. if there's no "copy" node available).
  2. Begin a search over the space of VSDGs that may be formed from the initial graph using any sequence of these operations:
    • first, split some subset of nodes whose results are demanded multiple times;
    • then, fully cover the graph with target machine instructions;
    • and finally, add serializing edges and spill registers until the graph corresponds to a single CFG and all edges can be assigned to registers without conflicts.
  3. Evaluate each candidate CFG according to the current optimization criteria, such as smallest code size or fastest execution time.
Although this still looks like a multi-phase optimization pipeline, the key difference is that we allow the compiler to backtrack over any decision that might turn out to be a bad idea. There may be multiple reasonable choices for nodes to split, multiple ways to select target machine instructions that compute the same results with different tradeoffs, and multiple ways to sequentialize and insert spills. I'd like to know what happens if you effectively check all combinations of those choices.

Of course, the details matter if the result is ever going to be used in a serious compiler. I have some questions about how to handle certain things, like pointers or break statements or exceptions, in a VSDG; but ignoring those concerns, the search also needs to produce reasonable answers quickly or most people will reject it. So a practical implementation needs good heuristics for value-ordering and for branch-and-bound pruning. Given those, I suspect Limited Discrepancy Search would be a good search strategy. (Sanseri reported good results with these choices in his thesis on IA-64 instruction selection, which seems like a good confirmation for my intuition.)

Anyone want to give this a try? I don't have high hopes that anyone reading my blog has spare time for such a project, but there's no harm in asking, right?

Delegating search to SMT solvers

As a side note, I also want to highlight various efforts to use SMT solvers to improve optimization. Effectively, such projects delegate the search to the SMT solver, which implements a good set of general-purpose heuristics. This is a smart approach, although querying an SMT solver is still a fairly expensive thing to do, so people have resisted doing it during normal compilation. As a result, these tools are primarily used offline, to identify improvements that should be made to a compiler, rather than online during compilation.
  • Sebastian Buchwald's "Optgen: A Generator for Local Optimizations" generates all possible peephole optimizations smaller than a given size limit, and in the paper he reported finding that GCC, LLVM, and the Intel C Compiler were all missing some optimizations that Optgen found.
  • Souper examines output from LLVM's optimization passes and looks for patterns that can be shown using SMT queries to have better alternatives. I can't find any published papers about this, but see John Regehr's blog for some reported results, such as "A Few Synthesizing Superoptimizer Results" from 2015. 
  • Lopez et al built Alive, a tool to prove LLVM peephole optimizations correct or generate counterexamples explaining why they're wrong; see "Provably Correct Peephole Optimizations with Alive".

Conclusions

  • Compilers are structured with many independent optimization passes because they're hard enough to get right that way, and worse if you try to combine them; but keeping them separate leads to suboptimal code.
  • Choosing a different intermediate representation in the compiler changes which parts of the compiler are hard to write. That may enable more sophisticated compilers by pushing complexity to places where it is easier to manage.
  • Since optimizations interact in complicated ways, I'd like to see more research into different ways to apply combinatorial search to find the best combination of optimizations for a given piece of code. I've proposed one approach that I haven't seen attempted yet.

Responses

Edit: Thanks to CalChris on Hacker News for the link to Unison, a recent LLVM pass that performs combined instruction scheduling and register allocation via constraint solving. See Lozano et al, "Register Allocation and Instruction Scheduling in Unison".