Tuesday, May 24, 2016

Optimal optimization

For most of the history of compilers, computers have been slow, and programmer time has been more valuable than program runtime. Compilers could, in theory, transform bulky and slow software into code that's faster and smaller than any human would write by hand, but the trade-off has rarely been worthwhile. Instead we get just those optimizations that always improve code, plus a few heuristically-guided optimizations that help sometimes but may miss opportunities for improvements.

Some compilers are very explicit about their goals with regard to these trade-offs between compile time and run-time performance. Google's "Go" compiler aims for fast compile times and is less concerned about optimization, for instance. The recently open-sourced Chez Scheme compiler has the philosophy that "Compile time is important as well as run time, so compiler optimizations are generally expected to pay their own way, i.e., indirectly benefit compiler performance enough to cover the direct cost of the optimization." These are somewhat extreme examples, but generally compilers lean toward optimizations that don't require very much computation to apply.

What might a compiler look like if we were willing to solve NP-complete problems in exchange for better generated code? In this post I'll start with the problem of optimal inlining, and explore other ideas in future posts.

Inlining and its dual, procedural abstraction

One important optimization implemented in many compilers is the ability to inline the body of a function into every place where that function was called.

The immediate benefit of inlining is that a function call requires some extra instructions to set up a stack frame and then undo it when the function returns, so inlining eliminates those instructions and uses slightly less memory bandwidth.

On the other hand, if the function is called more than once, then duplicating the instructions in that function could make the total program size substantially larger. In embedded systems, that could mean having to buy more expensive parts with more storage. On desktop and server CPUs, larger programs are less likely to fit in cache and could be significantly slower than an equivalent small program. So it's important to decide carefully whether inlining is worthwhile.

Typically, compilers decide whether to inline based on heuristics. If the function is called only once, then it's pretty safe to just inline it. Otherwise, the compiler probably has some size limit: if the function is fewer than this many instructions, say, then we'll inline it.

But inlining can have indirect benefits. If some calls to a function pass a constant value as one of its arguments, and the function chooses one of several branches based on that argument, then after inlining the compiler may be able to delete most of the function body as dead code.

We could try to adjust the heuristics to notice this case, but then some even more complicated situation will come up. In general, the only way to be sure you've found all the optimization opportunities is to try, for each call site, either inlining it or not. Since inlining one call may expose new opportunities for improving another call, in principle you'd have to try all 2^n combinations. (I think that may actually be under-counting.) This is a combinatorial search problem.

Let's look at the dual of this problem for a moment. Imagine the programmer wrote a bunch of copy-pasted code that does similar things over and over. They could have extracted a common pattern out to a new function and called that function each place they needed it, but for whatever reason they didn't. The resulting program is larger than it could be as a result, but presumably the resulting straight-line code is easily optimized and fast (aside from cache effects).

What if, after optimization, we could discover that all that copy-pasted code is still very similar, and the compiler could choose to extract it to a function in order to reduce code size? Now we have the opposite dilemma: will the cost of creating a stack frame outweigh the code size advantages of extracting a function? But this time, the question is easier to answer. We've already applied all the intraprocedural optimizations we possibly could, and extracting a function can't expose any new opportunities. (At least, none I can think of; counterexamples welcome!) So we can use only local information to make the decision, no search required.

The literature I've found on this calls it "procedural abstraction", and the reason we don't generally see compilers doing it is that it's computationally very expensive to find sections of code that are similar enough that they can be extracted to new functions. I've found three general approaches:
  1. Compute a fingerprint for each basic block. If several basic blocks have the same fingerprint, they're candidates for abstraction. This only works if each block has the same instructions, in the same order (but allowing for some variation, such as different register names), so it has low odds of finding matching code. It's fast, though. See, for example, Johan Runeson's 2000 master's thesis, "Code Compression through Procedural Abstraction before Register Allocation".
  2. Use a suffix trie to identify matching sub-sequences of instructions. This still requires instructions to appear in the same order, but can match instruction sequences that are smaller than a whole basic block. It's still reasonably fast.
  3. Treat the code as a directed graph and apply an algorithm for "frequent graph-based pattern mining", such as gSpan. This is NP-complete, but can match instructions regardless of what order they're in. See, for example, "Graph-Based Procedural Abstraction" by Dreweke et al, in the 2007 Proceedings of the International Symposium on Code Generation and Optimization; or chapter 4 of Neil E. Johnson's 2004 PhD thesis, "Code size optimization for embedded processors".

Of course, for this exercise, what's a little asymptotic complexity? Obviously the hardest option is the only sensible choice.

With a procedural abstraction algorithm in hand, let's go back to the problem of deciding when inlining is a good idea. Now we can take a conceptually easy approach:

Inline everything and let procedural abstraction sort it out!

(Recursive functions need a little care. If the language semantics allow tail-call optimization, then transform tail recursion into loops where possible first. After that, identify cycles in the call-graph and leave the targets of back-edge calls un-inlined. Those details aside, inline everything else!)

At this point we've replaced a small NP-complete problem (where n is the number of function calls) with a much larger NP-complete problem (where n is the number of instructions in the program after inlining every call). But there are two reasons this is interesting:
  • We'll get the optimal set of function boundaries for whatever other optimizations we've chosen to implement. If we have an intraprocedural optimization (like dead-code elimination) that can take advantage of having a particular call inlined, then it will be inlined. If the other optimizations we've implemented don't fire after inlining, then the function will be outlined again. But in the process we'll shrink the code further by identifying redundant snippets in the original program.
  • Since the author of the compiler doesn't need to think about how inlining might interact with any other optimizations, this optimization is perfectly modular. That's great for verifying that the implementation is correct, to avoid wrong-code compiler bugs, and great for maintenance as the compiler evolves.
As a side note, I haven't found any papers yet on applying the procedural abstraction idea to re-rolling loops, but it seems like a natural extension to me. (Please let me know if you've seen this elsewhere.) If a block of code does almost the same thing several times in a row with different constants, then we can extract those constants into an array and loop over the array. Whether that's faster or slower than a series of function calls could be a tricky question, but it's probably a code-size win.

Next time...

Inlining and procedural abstraction are one example of optimization opportunities which compilers generally leave on the table to avoid excessively long compile times. With all the complexity in a modern compiler it should be no surprise that there are more opportunities!

I have one more post like this planned; perhaps more will occur to me later. I'll write about the Value State Dependence Graph, and opportunities to improve on optimizations described in papers like Alan C. Lawrence's 2007 PhD thesis, "Optimizing compilation with the Value State Dependence Graph", among others. I think I can outline a unified framework for register allocation, code motion, and instruction selection, that would maximize performance for a variety of instruction set architectures with instruction-level parallelism. But we'll see how that turns out, next time!

Thursday, May 19, 2016

Perspectives on commit history

I've just read three blog posts talking about what policy open source projects should adopt with regard to what the Linux kernel community calls "the perfect patch".
All three are writing with the perspective of extensive experience using git for collaboration on open source projects. They fully understand the alternatives, and I respect their positions.

I'd like to offer a fourth position that's close to Joey's, with a slightly different emphasis: If you want new contributors to your project, never tell them to rebase. Don't ask them to "git rebase -i", don't ask them to "git pull --rebase", none of that.

Similarly, if you're teaching people how to use git for the first time, please do not teach them anything about rebase, except perhaps to say, "There's this thing called rebase. Don't use it until you understand 'git reflog' and have a pretty good sense of git's object model."

My problem with rebase is that, as Joey wrote, it "works well except when it doesn't, when some detail airbrushed out of the only remaining history turns out to be important." Worse, the tools for fixing rebase mistakes require deep understanding of how git works.

If you have a merge commit that went wrong (perhaps there are merge conflict markers still in it, or conflicts were resolved incorrectly) then you can check out the pre-merge version on a branch, re-do the merge, and use "git merge -s ours" to replace the effects of the bad merge. It's a bit tedious, but the procedure is reliable, and someone more experienced can do it for you if necessary. If you have somebody else's rebased commits that had merge conflicts that they resolved incorrectly, there's nothing you can do. If they're your own commits then you can dig around in your reflog and hope you can find the pre-rebase branch head, but that takes much more effort than if you simply merged.

To Matthew's point, squashing branches instead of merging them has many of the same problems, plus the separate problems he called out in his post. So I agree with him that squashing pull requests is a poor choice as well.

A side effect of this advice is that the Linux kernel style of demanding perfect patches is impractical if you want newcomers to contribute, unless you're willing to put in a lot of work on mentoring. (Which is one reason why Outreachy is a wonderful program!)

I sympathize with the kernel community and others who've adopted similar policies. As Joey pointed out, using "git bisect" to track down a bug requires a lot more work if your history contains lots of commits which don't compile or fail for unrelated reasons—but as he also pointed out, adding a "--merges-only" option to bisect would make that much less of an issue, and we shouldn't design our processes around the UX failures of our tools if we can just fix the tools instead.

A more compelling concern is that code review, which is still one of the best tools we have for software quality, is more difficult if you're trying to review a messy series of patches. So evaluate your costs and benefits. Does your project actually do code review? If not, don't worry about clean patches, because you have bigger problems. Even if you do code review, is it really that hard to evaluate the history of how your contributors actually made their changes? The Linux kernel has tens of millions of lines of code, making code review inherently challenging even with perfect patches. Is your project honestly anywhere near that scale?

It may be that whenever Josh Triplett gets around to releasing his current project (which I've seen nothing public on yet), we'll be able to have a different conversation around the ergonomics of managing a patch series. Until then, relaxing OCD requirements around what your commit history looks like in gitk or what your individual diffs contain allows you to accept contributions from more people and get them ramped up on improving your project. For most projects, that's worth more than anything else.

Monday, May 16, 2016

Angular velocity and quaternion derivatives

OK, this post is all about math which frustrated me for weeks, a year and a half ago. I've forgotten most of the details since then and I'm not fully convinced I understood them in the first place, but the question has come up again in discussing sensor fusion at the Portland State Aerospace Society (PSAS), so I'm going to try to re-discover the details here.

At the time I was working on a high-assurance quadcopter autopilot (SMACCMPilot), but it turns out that quadcopters have many of the same kinds of sensors as rockets, and similar sensor fusion problems. (I talked about sensor fusion for rockets in my first post here and I'm sure it'll come up again.)

I have the impression that this topic is one where people who've actually been trained in aerospace think the answer is obvious. Consider the response I got when I tried to ask the author of a Kalman filter I wanted to re-use, "What representation is expected for gyro measurements?"

I have no formal aerospace training and I still don't think it's obvious at all. Being told to "See standard texts on strapdown nav" was not helpful—especially since I have no idea which texts are "standard", even if one of the best sources I eventually found did turn out to be a book titled "Strapdown Inertial Navigation Technology".

The question

The short, technical, version of the question: How do you use a three-axis rate gyro to update a quaternion that represents your vehicle's current orientation? (If you already know what this question means you can skip to the next section.)

Quadcopters and rockets both usually have "rate gyros", which are sensors that will tell you how fast your vehicle is spinning. You might interpret a rate gyro's output as a number measured in degrees per second. (The actual units don't matter for anything except your sanity, of course. Degrees per second is probably what you'll find in your sensor's datasheet; radians per second may be more convenient when working with trig functions; cycles per second might be best when presenting numbers to humans.)

Each gyro measures rotation around a single axis; you need three of them to measure complete 3D angular velocity. In aerospace the three rotation axes are usually called "roll", "pitch", and "yaw", when measured with respect to the vehicle itself.

6DOF en

But for many purposes we don't care how fast the vehicle is rotating. What we want to know is which direction the vehicle is pointing, with respect to some convenient frame of reference, such as North/East/Down (NED) or Earth-Centered/Earth-Fixed (ECEF).

It might seem like a natural way to represent that orientation would be as an angle around each of the three axes. That representation is referred to as the Euler angle representation. There's a problem with that representation, called "gimbal lock", which has been written about extensively elsewhere so I won't cover it here. Euler angles are also weird because rotations aren't commutative so you have to pick a convention for which order to apply the rotations in, and there are literally a dozen choices. Ick.

Because of the problems with the Euler angle representation, aerospace systems usually use either a 3x3 rotation matrix, or a unit quaternion. Quaternions have several advantages over the matrix representation, including that they're smaller (4 numbers instead of 9); and that numerical precision problems due to floating-point inaccuracy are easier to fix up on quaternions.

Quaternions have the disadvantage that they are utterly baffling, but let's not let that stop us, eh?

If we assume that we know precisely what direction the vehicle was pointing when it powered on, and we have perfect measurements of how fast it's been rotating since then, then conceptually, we can just add up all those rotations to figure out which direction the vehicle is pointing now. In practice it isn't so easy because we don't have perfect information about the initial orientation nor about the rotations since then, but dealing with those problems is what sensor fusion is all about.

Sensor fusion algorithms such as the Kalman filter augment a model which assumes perfect information to turn it into one that can take advantage of imperfect information instead. So in this post I'll pretend like we do have perfect information. Using the model we'll derive here in a sensor fusion implementation is left as an exercise for the reader. (Or you can study my Haskell sensor fusion example, or Paul Riseborough's Matlab/C++ version whose example I was trying to follow. Just be careful with the latter as I found plenty of implementation bugs in it, and be careful with the former because I make no guarantees about having the right model.)

Putting all that together: how do we do the "adding up" of a series of 3-axis rotation measurements taken from rate gyros, given a unit quaternion representation for orientation?

Why am I confused?

When asking for help understanding Paul Riseborough's filter, I wrote:
[The filter] constructs deltaQuat by assuming that correctedDelAng is in an axis-angle representation, where the magnitude of the Vector3f is the angle, and the direction of the vector is the axis of rotation.

I expected that any gyro measurement would instead be represented as Euler angles, that is, three orthogonal measurements of rotation rate. That's not algebraically equivalent to the axis-angle representation, as far as I can tell, except when any two of the components are 0.
Recently one of the students in PSAS asked me:
I thought what I needed to do was store three vectors for local pitch/yaw/roll directions, in terms of my local tangent plane coordinate frame. Then, when I get data from the rate gyros, I can convert the rotations into quaternions using the local pitch/yaw/roll axis vectors and then compute the updated local axis vectors.

But that seems wrong because I know rotations are non-commutative, and I'd have to choose what order to perform the rotations in. So I think there's something I'm missing.
I think these are basically the same confusion. It turns out that this is really easy mathematically, but physically speaking, I still don't understand why!

The answer

Short version:
  1. If your current orientation is represented by a quaternion q, and you define an update quaternion p by making the vector part of p equal the gyro measurement vector while the scalar part is 0, then the derivative of q is 0.5q·p.
  2. Discretized, that means that the new value of q is equal to the old value of q, times a quaternion derived from treating the gyro measurement as a 3-vector axis/angle representation, where the magnitude of the measurement vector is the rotation angle, and the norm of the vector is the axis around which the vehicle is rotating.
  3. Computing the norm of the gyro measurement vector requires dividing each element by the magnitude of the vector. But if the vehicle isn't rotating then the magnitude is 0. An equivalent version without any division by 0 comes from taking the Taylor series expansion of the axis/angle formula, which also allows you to decide how much precision you care about in your approximation of sine and cosine.
My main sources for these answers are the book "Strapdown Inertial Navigation Technology", 2nd edition, by Titterton and Weston; and a nearly-unreadable PDF titled "Quaternion Differentiation".

Both sources assert point #1, but I think that the PDF gives a derivation justifying it. I'm not entirely sure, to be honest, because although I'm convinced of every algebra step in that paper, I'm not sure what I can conclude from them, or why I should believe their initial assumptions. It's really poorly written.

Point #2 is derived from point #1 in the book, in section 11.2.5, on pages 319-320. The method is basically the same as that shown in the Wikipedia article on discretization:
  1. Re-write the differential equation so that it's a matrix multiplication instead of a quaternion multiplication, because the discretization procedure works on matrices, not quaternions. They do this by replacing the right-hand side with 0.5W·q, where W is the matrix representation of the conjugate of p. (I think they take the conjugate because they've moved it to the left of q instead of the right? I'm not sure.) Note that q still has the same 4 elements, only now we're pretending it's any old 4-vector instead of a quaternion.
  2. Assume that the discrete time steps are short enough that the angular velocity is basically constant. This is the "zero-order hold" assumption, according to Wikipedia.
  3. Integrate W over the current time-step, which is easy given zero-order hold: just multiply all the elements by T, the length of the time-step. (The book doesn't explain this step but it's really important that you multiply by delta-t somewhere!)
  4. Compute the matrix exponential of 0.5·T·W, and multiply the result by the old q to get the new q.
  5. But we want quaternions instead of matrices, so at this point the book turns that matrix exponential back into a quaternion through magic. A similar development was given in section 11.2.1 on pages 312-313, so they handwave it here. I'll handwave it too, but if you squint at the power series definition of matrix exponential you can see how it's like the Taylor series expansion of the sine and cosine terms in the axis-angle to quaternion formula. Specifically, if we define σ as the gyro measurement vector times the length of the time step, then the rotation axis is the norm of σ, and the rotation angle is the magnitude of σ. If we call the resulting quaternion r, then the new q is the old q times r. (Notice we've magically flipped back to multiplying on the right! I can't explain that. I bet there's another conjugate hiding in the hand-waving.)
OK. That was a lot of magic, but I don't feel too terrible about it because (a) at least lots of sources seem to agree on the basic steps and (b) I find them convincing enough that these are principled steps and not just stuff somebody made up that seemed to work.

The book goes on to derive point #3 from point #2. I don't want to go into detail on the Taylor series expansion of sine and cosine in this post, but if it helps, I wrote a Haskell function which computes an arbitrary-order approximation of the axis-angle to quaternion formula. In addition to avoiding any division by 0, this definition is nice because it has no square-roots, only three multiplies, plus one division and addition per order, and it computes both sine and cosine at the same time. On non-X86 CPUs, which don't typically have hardware sine/cosine instructions, I think this is about as efficient of an algorithm as you can get.

Conclusion

Just like when I was first banging my head against this question, in today's post I have not managed to justify the answers here from first principles of physics or anything. I decided then, and I'm still OK with this now, that I had dug far enough to be satisfied that there is some underlying reason why this is the right math.

But if anyone wants to explain where that initial formula of 0.5q·p comes from, please leave a comment!

Monday, May 9, 2016

How to eliminate "goto" statements

I wanted to write a program that would translate C programs into Rust programs, at least for some subset of C programs which are somehow "simple enough". You can follow along with my attempt in my GitHub repository for "Corrode". It isn't all that interesting yet.

One reason I haven't gotten very far on it is that along the way I got distracted by a question. C supports "goto" statements. Rust does not. Can programs with "goto" statements be turned into programs without them?

This turns out to be an interesting question, which people have been studying since at least a 1966 paper by Böhm and Jacopini. Wikipedia's "Structured program theorem" article gives good background on the question and its history. (If you want a formal treatment, try the 2008 paper, "The Böhm–Jacopini Theorem Is False, Propositionally".)

The short answer is that yes, it is always possible to rewrite a program to not use "goto" statements. But you might not like the result.

Impact

Why have so many people studied this question and for so long? I wouldn't blame you if you thought it sounded like a point of mere academic curiosity. But I think we can divide the continued interest into two categories.

The first category is code style, quality, and expressiveness: If we design our programming languages to force programmers to use "structured" programming style instead of goto statements, are we preventing them from writing useful programs? The Böhm-Jacopini theorem shows that the answer is "no": this kind of restriction does not change what programs can be written, in principle.

That said, we have many language features which are not strictly necessary, and if you would like to see just how bad of an idea it is to get rid of them, try programming in some of the "Turing tarpit" class of esoteric programming languages. In this post I'm not going to take a position on whether "goto" should be allowed, because I think the answer is complicated; but the designers of Rust have chosen a position in that debate, and that's all that matters for my current project.

The second category is in program analysis and optimization algorithms, which often only work on structured programs. According to several of the papers I've read while researching this question, modern compilers may just refuse to optimize functions that contain unstructured control flow. It would be nice to know if either we should avoid "goto" statements in performance critical code, or if compilers could be made smarter!

As a specific example, a few years ago I was chatting with a developer working on the Mesa 3D Graphics Library about their compiler for the OpenGL Shading Language. This developer told me that on GPUs that were common at the time, the only control flow instructions were for structured control flow. The hardware understood "if" statements and "while" loops, and if the compiler couldn't translate a shader into those primitives, it either rejected the program or it generated much slower code; I forget which. (I think this had something to do with GPUs using a single instruction pointer for multiple threads of execution.) So a good answer to this question might have had a significant impact on video game performance at the time! Unfortunately, I think by then there wasn't much interest in performance tuning for that generation of hardware, so Mesa's compiler never implemented the known algorithms.

My current interest is mostly in the first category, because I would like to transform C programs into Rust programs without destroying the structure of the program. In an ideal world, the output would resemble a program that a person might reasonably have written by hand. But I think the second category is also fascinating, particularly since program optimization can have a big impact.

Approaches

Wikipedia describes a trivial, but terrible, approach: wrap the entire function in a "while"-loop, and use a variable to keep track of which part of the function should run next. The extra variable then plays the same role as the program counter in a CPU, which shows that any program you could run on a CPU can be transformed this way. But as well-known computer scientist Donald Knuth pointed out, this transformation destroys the original structure of the program, so this is not a satisfying answer.

The textbook "Compilers: Principles, Techniques, and Tools" by Aho, Sethi, and Ullman, usually referred to as "the dragon book" due to its cover art, is often used as the textbook for compilers courses. (I've heard people recommend against using it as an introductory text, but it's the only general-purpose compilers book I've actually read, so I'll cite it here...) In chapter 10, "Code Optimization", it describes an algorithm called "T1 - T2 Analysis", which identifies any control flow that can be described using only while and if statements, without goto. This algorithm works even on functions written entirely using conditional goto statements, as long as an equivalent while program exists. Such functions are called "reducible".

If a function is irreducible, then T1-T2 analysis identifies some parts of the function that are causing trouble. One way to make those parts work without goto statements is to duplicate them, so that the code at the target of the goto now appears at every place which originally branched to it. This is called "node splitting", and sometimes labeled "T3" as it's a step you can take between applying the T1 and T2 steps of T1-T2 analysis. Node splitting can lead to a function which is exponentially larger than the original, though (see "Folklore Confirmed: Reducible Flow Graphs are Exponentially Larger"). Rather a lot of research has gone into splitting as few nodes as possible (see "Controlled Node Splitting" and "Handling Irreducible Loops: Optimized Node Splitting vs. DJ-Graphs", for two examples).

In fact there are several more ways to deal with irreducible flow graphs, and I recommend reading chapter 6 of Reverse Compilation Techniques, Cristina Cifuentes' PhD thesis, if you would like a much more complete overview with plenty of citations.

Conclusion

Maybe we shouldn't care about irreducible flow graphs. It turns out that the ways people actually use "goto" statements almost always still lead to reducible flow graphs, according to James Stanier and Des Watson's 2010 paper, "A study of irreducibility in C programs". They examined 15 open source projects which together contained 1,709 "goto" statements. They found that out of 10,427 functions, only 5 had irreducible control flow! They also found that the code generators "flex" and "bison", despite generating plenty of "goto" statements, never generated irreducible functions for the eight grammars they tested.

So for my project, I'll probably just use the T1-T2 approach once I encounter some interesting C source that's using goto statements, and until then I'll do naive syntax-driven transformations.

But finding what people have written about this topic has been a fun diversion!