Friday, October 28, 2016

Corrode update: support from Mozilla, and new features

It's been a while since I've written about Corrode; I've just been quietly working on it and merging contributions from several other people. I have a few things I'd like to mention today.

Funding from Mozilla

The big news: Mozilla is providing funding for me to work on Corrode for the next few months! I'm super excited to get to work on some of the trickier challenges in translating C to Rust, such as eliminating "goto" statements. More on that later, but first, let me tell you about the specific goal that Mozilla is providing support for.

One challenge for computing today is bit-rot in software that nobody wants to maintain any more, but that's still useful to some community. (This is a problem for both open-source and proprietary software, but let's focus on the open-source side as there are fewer roadblocks for a person who's motivated to do something about it.) When users discover bugs and security flaws in unmaintained software, they're often just out of luck.

Mozilla has an interest in finding better ways for people to build software, whether that's through better programming tools like Rust, or more effective ways of organizing an open-source community. So the experiment I'm working on is to see if it makes sense to "rescue" unmaintained open source projects that were written in C by semi-automatically translating them to Rust.

To find an interesting case study, I looked through Debian's list of orphaned packages to find projects that have a reasonable amount of C source. I especially wanted to find a project with a network component, so if there are security flaws they may be remotely exploitable, because Rust's safety guarantees have more impact there. Then I cross-checked Debian's "Popularity Contest", an opt-in census of which packages are installed on Debian systems, to pick a package that's relatively widely-installed despite being unmaintained.

The package I settled on: CVS, the venerable "Concurrent Versions System". The first release of CVS was on November 19, 1990, almost 26 years ago! (This is not the oldest codebase I've worked on, but it's close.) The last upstream release of CVS was in 2008, even though it had a security vulnerability discovered in 2012. This is clearly 50k lines of unmaintained C.

CVS was largely supplanted by Subversion a decade ago, which has largely been supplanted by git today. Anyone choosing CVS for a new project would get asked some pointed questions. But there are still tons of open source projects where their history is only available via CVS; you can find plenty of examples hosted at sites like Sourceforge or Savannah. And according to the Popularity Contest statistics, 6.5% of Debian users still have CVS installed. So there's value in keeping the CVS implementation alive.

I'm currently working on improving Corrode to be able to translate as much of CVS as I can manage. After that I'll use the generated Rust source to study how much manual work is needed to clean up Corrode's output, and see if there are opportunities for automating more. This effort will wrap up by the end of the year, and hopefully I'll have plenty to report by then.

Thanks so much to Mozilla for supporting this work!

Recent changes

I want to take a moment to thank all the people who have contributed to Corrode so far. I've merged pull requests from: Alec Theriault, Nathan Bergey, Vickenty Fesunov, Fabian Zaiser, Jeff Waugh, Jeremie Jost, Nabil Hassein, Amin Bandali, Getty Ritter, Robert Grosse, Sean Jensen-Grey, and Taylor Cramer. Thanks also to everyone who has filed bug reports, or commented on them to clarify details about Rust or C!

It's now possible to use Corrode as a replacement for GCC, with the corrode-cc script from the Corrode source tree. There are still a lot of bugs, but it's good enough that I'm usually testing Corrode now by running `make CC=corrode-cc` in various source trees, including CVS and musl-libc.

All C control-flow statements are implemented now, thanks to Fabian finishing off do-while loops... with the significant exceptions of switch and goto statements, which I'm working on now in a new 'cfg' branch. (See the first post I wrote about Corrode for background; I've been looking forward to this part for months!)

I've fixed a bunch of small issues that either made Corrode give up during translation due to not understanding the input, or made it generate invalid Rust output. The biggest change was to only translate declarations that are actually needed for the current translation unit. C header files are full of all sorts of junk, but if you don't actually reference a declaration then Corrode doesn't need to report any translation errors for that declaration. Fixing this let me translate my first 14 source files in the Linux kernel, so that was pretty exciting!

One change motivated specifically by my work on CVS: Corrode now handles K&R-style function definitions. Yes, CVS still uses pre-C89 style code in some places, while using C99 features in other places. That'll happen in a project of this age.

One interesting bug was around calling through function pointers stored in struct fields. In Rust, `s.f()` calls a trait method named "f" on "s". To call a function pointed to by a field named "f", you need parentheses, like so: `(s.f)()`. The pretty-printer for Corrode's Rust AST now inserts these parentheses as needed.

Another pretty-printer bug is something I tried to squash before, but hopefully got right this time: blocks in expressions. Rust's parser effectively automatically inserts semicolons after blocks under certain circumstances, so programmers don't need to follow up every if-statement or for-loop with a semicolon. If we want to generate something like this:
if c { t } else { f } as f32;
(which is a weird thing to do, but cases like this keep coming up) then we need to wrap the if-statement in parentheses to keep Rust's parser from treating the end of the else-branch as the end of the statement. However, we don't want to insert parentheses when they aren't needed, because sometimes Rust will actually warn that there are unnecessary parentheses! Fortunately I think I have parentheses in exactly the right places now.

And, because it was driving me mad, I fixed indentation for pretty-printing if-else ladders. They used to zig-zag further to the right, and sometimes some branches would all be on one line while other branches were split across multiple lines. Fixing this had no effect on correctness of the output but it's substantially more readable now!

Coming up next...

I'm preparing a new repository for the CVS source code that will include scripts to let anyone reproduce my progress on that translation. I'm having a little trouble scripting some of the steps I do, so that's taking longer than I hoped, but it'll happen at some point.

A significant fraction of source files in CVS contain either switch or goto statements so my major focus short-term is on translating those control-flow constructs correctly. If you'd like to help, check out the discussion around goto and switch.

I'll post progress reports like this regularly. Keep an eye out!