Monday, May 9, 2016

How to eliminate "goto" statements

I wanted to write a program that would translate C programs into Rust programs, at least for some subset of C programs which are somehow "simple enough". You can follow along with my attempt in my GitHub repository for "Corrode". It isn't all that interesting yet.

One reason I haven't gotten very far on it is that along the way I got distracted by a question. C supports "goto" statements. Rust does not. Can programs with "goto" statements be turned into programs without them?

This turns out to be an interesting question, which people have been studying since at least a 1966 paper by Böhm and Jacopini. Wikipedia's "Structured program theorem" article gives good background on the question and its history. (If you want a formal treatment, try the 2008 paper, "The Böhm–Jacopini Theorem Is False, Propositionally".)

The short answer is that yes, it is always possible to rewrite a program to not use "goto" statements. But you might not like the result.

Impact

Why have so many people studied this question and for so long? I wouldn't blame you if you thought it sounded like a point of mere academic curiosity. But I think we can divide the continued interest into two categories.

The first category is code style, quality, and expressiveness: If we design our programming languages to force programmers to use "structured" programming style instead of goto statements, are we preventing them from writing useful programs? The Böhm-Jacopini theorem shows that the answer is "no": this kind of restriction does not change what programs can be written, in principle.

That said, we have many language features which are not strictly necessary, and if you would like to see just how bad of an idea it is to get rid of them, try programming in some of the "Turing tarpit" class of esoteric programming languages. In this post I'm not going to take a position on whether "goto" should be allowed, because I think the answer is complicated; but the designers of Rust have chosen a position in that debate, and that's all that matters for my current project.

The second category is in program analysis and optimization algorithms, which often only work on structured programs. According to several of the papers I've read while researching this question, modern compilers may just refuse to optimize functions that contain unstructured control flow. It would be nice to know if either we should avoid "goto" statements in performance critical code, or if compilers could be made smarter!

As a specific example, a few years ago I was chatting with a developer working on the Mesa 3D Graphics Library about their compiler for the OpenGL Shading Language. This developer told me that on GPUs that were common at the time, the only control flow instructions were for structured control flow. The hardware understood "if" statements and "while" loops, and if the compiler couldn't translate a shader into those primitives, it either rejected the program or it generated much slower code; I forget which. (I think this had something to do with GPUs using a single instruction pointer for multiple threads of execution.) So a good answer to this question might have had a significant impact on video game performance at the time! Unfortunately, I think by then there wasn't much interest in performance tuning for that generation of hardware, so Mesa's compiler never implemented the known algorithms.

My current interest is mostly in the first category, because I would like to transform C programs into Rust programs without destroying the structure of the program. In an ideal world, the output would resemble a program that a person might reasonably have written by hand. But I think the second category is also fascinating, particularly since program optimization can have a big impact.

Approaches

Wikipedia describes a trivial, but terrible, approach: wrap the entire function in a "while"-loop, and use a variable to keep track of which part of the function should run next. The extra variable then plays the same role as the program counter in a CPU, which shows that any program you could run on a CPU can be transformed this way. But as well-known computer scientist Donald Knuth pointed out, this transformation destroys the original structure of the program, so this is not a satisfying answer.

The textbook "Compilers: Principles, Techniques, and Tools" by Aho, Sethi, and Ullman, usually referred to as "the dragon book" due to its cover art, is often used as the textbook for compilers courses. (I've heard people recommend against using it as an introductory text, but it's the only general-purpose compilers book I've actually read, so I'll cite it here...) In chapter 10, "Code Optimization", it describes an algorithm called "T1 - T2 Analysis", which identifies any control flow that can be described using only while and if statements, without goto. This algorithm works even on functions written entirely using conditional goto statements, as long as an equivalent while program exists. Such functions are called "reducible".

If a function is irreducible, then T1-T2 analysis identifies some parts of the function that are causing trouble. One way to make those parts work without goto statements is to duplicate them, so that the code at the target of the goto now appears at every place which originally branched to it. This is called "node splitting", and sometimes labeled "T3" as it's a step you can take between applying the T1 and T2 steps of T1-T2 analysis. Node splitting can lead to a function which is exponentially larger than the original, though (see "Folklore Confirmed: Reducible Flow Graphs are Exponentially Larger"). Rather a lot of research has gone into splitting as few nodes as possible (see "Controlled Node Splitting" and "Handling Irreducible Loops: Optimized Node Splitting vs. DJ-Graphs", for two examples).

In fact there are several more ways to deal with irreducible flow graphs, and I recommend reading chapter 6 of Reverse Compilation Techniques, Cristina Cifuentes' PhD thesis, if you would like a much more complete overview with plenty of citations.

Conclusion

Maybe we shouldn't care about irreducible flow graphs. It turns out that the ways people actually use "goto" statements almost always still lead to reducible flow graphs, according to James Stanier and Des Watson's 2010 paper, "A study of irreducibility in C programs". They examined 15 open source projects which together contained 1,709 "goto" statements. They found that out of 10,427 functions, only 5 had irreducible control flow! They also found that the code generators "flex" and "bison", despite generating plenty of "goto" statements, never generated irreducible functions for the eight grammars they tested.

So for my project, I'll probably just use the T1-T2 approach once I encounter some interesting C source that's using goto statements, and until then I'll do naive syntax-driven transformations.

But finding what people have written about this topic has been a fun diversion!

No comments:

Post a Comment