Wednesday, November 16, 2016

Go To Statement Considered (Mostly) Harmless

While this post's title is a mash-up of "Hitchhiker's Guide to the Galaxy" with Dijkstra's 1968 letter, "Go To Statement Considered Harmful", today I have a bunch of Corrode-related news to share, not just commentary on "goto". But let's start there, because I'm super proud of this...

Translating C's "switch" and "goto" to Rust

Six months ago I wrote 'How to eliminate "goto" statements', about this question:
C supports "goto" statements. Rust does not. Can programs with "goto" statements be turned into programs without them?
Over the last two weeks I've finally gotten to write code tackling this problem. (Thanks again to Mozilla for supporting this project! I could not afford to dedicate the time this took without their support.)

The result: Corrode now understands all forms of C statements except for the GCC extensions for "local labels", "computed goto", and inline assembly. The last holdouts were "goto" statements, obviously, as well as "switch" statements, which in C can lead to similarly complex control flow.

I'm so excited, because this is a huge milestone! Lots of C code uses "switch" statements, and regardless of Dijkstra's opinions, lots of C code uses "goto" as well. None of the remaining GCC extensions to statement syntax are used widely enough to have nearly as much impact.

The implementation looks very little like I expected it would six months ago. General implementation strategy:
Gee, translating "goto" statements is hard. I know! I'll turn everything into "goto" statements, and then turn the resulting mess back into structured Rust. That'll be easier!
If you're interested in the details, check out the re-structuring implementation. It's a 12-page paper describing the algorithms and invariants, with code interspersed in literate programming style, like the rest of Corrode. (And if you find any of the explanations confusing, I would love to receive pull requests making the documentation easier to understand!)

It is still possible to write C programs using "goto" or "switch" that Corrode can't translate to Rust yet, but they're less common than you might expect. They fall into two categories:
  • If it's possible to enter a loop at two or more different places, the control flow is called "irreducible" and Corrode can't translate it, yet, because it fundamentally can't be represented in Rust without duplicating code or introducing new variables. The classic example of this is "Duff's Device". Many algorithms in compilers have trouble with irreducible control flow, so Corrode is not alone in this. But in "A study of irreducibility in C programs" from 2010, Stanier and Watson found that less than 0.05% of functions had irreducible control flow, across 15 large open source projects. So this is not a big concern, although I have thoughts on how to handle it anyway.
  • The more common case involves certain "goto fail"-style control flow. A surprisingly wide range of "goto fail"-style control flow actually works just fine when Corrode translates it to Rust. I haven't figured out yet how to characterize which cases don't work yet.
One more challenge with the new implementation: It produces very noisy Rust.
  • The output is full of extra "break" and "continue" statements in places they aren't strictly necessary.
  • Every loop has a label, in case there are any multi-level exits that need to refer to it, and every "break" or "continue" uses the appropriate loop label even if it doesn't need to.
  • Corrode no longer generates "while" loops, just infinite "loop" statements with a conditional "break" at the beginning.
All three of those problems should be fixable, but I needed to get the basic algorithm right first before figuring out how to produce more readable Rust.

Progress translating CVS to Rust

Last time I talked about why CVS is my Corrode current case study. Between the above work on goto/switch and a bunch of smaller fixes, the "Rustification" of CVS is going pretty well!

Corrode is now able to produce Rust output of some sort for about 64% of the C source files in CVS. (I'm not counting C source for other platforms that doesn't get built on my system; I get 91 .rs files when I build CVS with Corrode, compared to 143 .o files in the build overall.)

Of the remaining errors causing Corrode to fail on CVS source files, roughly one-third are due to over-complicated control flow. Another third are due to various uses of C-style union types; see Corrode issue #22 for background. (I need to work out how to merge the partial implementation of unions from pull request #63!)

The rest are mostly things that will probably never work in Rust, but these usually can be manually fixed in the C source before running Corrode. Some examples:
  • Enum-typed global variables without an explicit initializer are a problem because C's default zero-initialization policy doesn't make sense in Rust, where enum types aren't fully isomorphic to integers. I suspect I'll edit the CVS source to provide initializers for these.
  • Rust doesn't allow declaring functions that take variable numbers of arguments, but the CVS code base declares several. These functions just have to stay in C, or re-write the callers to not use a varargs-style API.
  • One CVS header declares a function pointer that takes exactly 3 arguments, but actually expects the corresponding function to take a variable number of arguments. C allows passing more arguments than declared, but Rust doesn't. I expect that changing the declaration to explicitly accept a variable number of arguments will fix the immediate problem, although then of course I'll run into the varargs problem from the previous point.
  • Rust doesn't support C's `alloca` function for variable-sized stack allocations. I guess I'll probably carefully replace it with `malloc`/`free` pairs in the right places, and then try to clean up the resulting Rust.
Once Corrode has generated a pile of Rust source files, then we get to find out how many of them the Rust compiler will actually accept. At the moment, the answer is about 44%. Here's what goes wrong with the other 56%:
  • Nearly 75% of the errors in the generated Rust are "use of possibly uninitialized variable" complaints.
    • This is often because some uninitialized variable is passed by reference to a function that only uses it as an out-parameter, which means it's safe, but the compiler can't tell that. I think the right option is to refactor functions with out-parameters to return tuples in Rust, optionally with a C compatibility wrapper if the function still needs to be callable from non-Rust functions.
    • Another class of "uninitialized variable" errors is because the C program assigned to all of the fields of a struct individually, but Rust can't conclude that the struct itself is initialized. This case should be fixable by using C99 structure literals in the pre-Corrode source, or hacking up the Rust output to use struct constructors.
  • 16% of error messages are due to issue #77, which is that Corrode doesn't translate nullable function pointers in a way that the Rust compiler will accept. (I had assumed I could treat them the same as raw pointers, which are nullable and allow arbitrary casts, but nope; in Rust, function pointers are strongly type-safe and memory-safe.)
  • 3.3% are due to Rust not supporting as many kinds of constant expressions as C does. In particular, I think it ought to be possible to evaluate if-expressions, calls to "std::mem::size_of", and calls to "as_ptr" on types such as "str", so long as their arguments or operands are all constant. (I haven't checked exactly what current nightly Rust allows, yet, but I don't think this has changed.)
  • 2.3% are due to issue #92, which is that Rust doesn't allow a parameter or variable binding to have the same name as a static variable that's in scope. I'm not sure I quite understand why Rust does this, so if somebody could explain what's going on in that issue I'd appreciate it!
The last few issues are pretty evenly divided into these categories:
  • The new algorithm for handling arbitrary control flow doesn't yet try to place variable declarations correctly, so a variable that was in scope in the original C may not be in scope in the generated Rust. It happens to almost always work out correctly anyway, so I'm not too worried about it yet, but I'm tracking it in issue #93.
  • Corrode sometimes translates uses of C-style unions that it isn't actually able to handle correctly, but the error isn't detected until the Rust compiler tries to figure out what to do with it. This will be fixed by just fully implementing unions.
  • When dealing with enum types, Corrode currently adheres a little too closely to the letter of the C standard, and aggressively int-promotes enum values. Several errors would be fixed if we leave values as enum type as long as possible.
Over time, I'd love to automate as many of these fixes as possible. But in the short term I'll probably just manually fix each issue so I can find out what other challenges await.

Next up

Now that I have a fairly substantial chunk of generated Rust to play with, I'm going to start scrutinizing it to find interesting patterns. Based on what I find, I'll write up advice for people who want to turn unsafe C-style Rust into safe, idiomatic Rust, and if we're lucky, somebody will be able to automate some of that advice.

Until then, I invite you to:
  • read the implementation of the C translator and the control-flow graph algorithms;
  • try running Corrode on your own programs and see how well it does;
  • and contribute documentation, bug reports, or code if you can!