Friday, October 28, 2016

Corrode update: support from Mozilla, and new features

It's been a while since I've written about Corrode; I've just been quietly working on it and merging contributions from several other people. I have a few things I'd like to mention today.

Funding from Mozilla

The big news: Mozilla is providing funding for me to work on Corrode for the next few months! I'm super excited to get to work on some of the trickier challenges in translating C to Rust, such as eliminating "goto" statements. More on that later, but first, let me tell you about the specific goal that Mozilla is providing support for.

One challenge for computing today is bit-rot in software that nobody wants to maintain any more, but that's still useful to some community. (This is a problem for both open-source and proprietary software, but let's focus on the open-source side as there are fewer roadblocks for a person who's motivated to do something about it.) When users discover bugs and security flaws in unmaintained software, they're often just out of luck.

Mozilla has an interest in finding better ways for people to build software, whether that's through better programming tools like Rust, or more effective ways of organizing an open-source community. So the experiment I'm working on is to see if it makes sense to "rescue" unmaintained open source projects that were written in C by semi-automatically translating them to Rust.

To find an interesting case study, I looked through Debian's list of orphaned packages to find projects that have a reasonable amount of C source. I especially wanted to find a project with a network component, so if there are security flaws they may be remotely exploitable, because Rust's safety guarantees have more impact there. Then I cross-checked Debian's "Popularity Contest", an opt-in census of which packages are installed on Debian systems, to pick a package that's relatively widely-installed despite being unmaintained.

The package I settled on: CVS, the venerable "Concurrent Versions System". The first release of CVS was on November 19, 1990, almost 26 years ago! (This is not the oldest codebase I've worked on, but it's close.) The last upstream release of CVS was in 2008, even though it had a security vulnerability discovered in 2012. This is clearly 50k lines of unmaintained C.

CVS was largely supplanted by Subversion a decade ago, which has largely been supplanted by git today. Anyone choosing CVS for a new project would get asked some pointed questions. But there are still tons of open source projects where their history is only available via CVS; you can find plenty of examples hosted at sites like Sourceforge or Savannah. And according to the Popularity Contest statistics, 6.5% of Debian users still have CVS installed. So there's value in keeping the CVS implementation alive.

I'm currently working on improving Corrode to be able to translate as much of CVS as I can manage. After that I'll use the generated Rust source to study how much manual work is needed to clean up Corrode's output, and see if there are opportunities for automating more. This effort will wrap up by the end of the year, and hopefully I'll have plenty to report by then.

Thanks so much to Mozilla for supporting this work!

Recent changes

I want to take a moment to thank all the people who have contributed to Corrode so far. I've merged pull requests from: Alec Theriault, Nathan Bergey, Vickenty Fesunov, Fabian Zaiser, Jeff Waugh, Jeremie Jost, Nabil Hassein, Amin Bandali, Getty Ritter, Robert Grosse, Sean Jensen-Grey, and Taylor Cramer. Thanks also to everyone who has filed bug reports, or commented on them to clarify details about Rust or C!

It's now possible to use Corrode as a replacement for GCC, with the corrode-cc script from the Corrode source tree. There are still a lot of bugs, but it's good enough that I'm usually testing Corrode now by running `make CC=corrode-cc` in various source trees, including CVS and musl-libc.

All C control-flow statements are implemented now, thanks to Fabian finishing off do-while loops... with the significant exceptions of switch and goto statements, which I'm working on now in a new 'cfg' branch. (See the first post I wrote about Corrode for background; I've been looking forward to this part for months!)

I've fixed a bunch of small issues that either made Corrode give up during translation due to not understanding the input, or made it generate invalid Rust output. The biggest change was to only translate declarations that are actually needed for the current translation unit. C header files are full of all sorts of junk, but if you don't actually reference a declaration then Corrode doesn't need to report any translation errors for that declaration. Fixing this let me translate my first 14 source files in the Linux kernel, so that was pretty exciting!

One change motivated specifically by my work on CVS: Corrode now handles K&R-style function definitions. Yes, CVS still uses pre-C89 style code in some places, while using C99 features in other places. That'll happen in a project of this age.

One interesting bug was around calling through function pointers stored in struct fields. In Rust, `s.f()` calls a trait method named "f" on "s". To call a function pointed to by a field named "f", you need parentheses, like so: `(s.f)()`. The pretty-printer for Corrode's Rust AST now inserts these parentheses as needed.

Another pretty-printer bug is something I tried to squash before, but hopefully got right this time: blocks in expressions. Rust's parser effectively automatically inserts semicolons after blocks under certain circumstances, so programmers don't need to follow up every if-statement or for-loop with a semicolon. If we want to generate something like this:
if c { t } else { f } as f32;
(which is a weird thing to do, but cases like this keep coming up) then we need to wrap the if-statement in parentheses to keep Rust's parser from treating the end of the else-branch as the end of the statement. However, we don't want to insert parentheses when they aren't needed, because sometimes Rust will actually warn that there are unnecessary parentheses! Fortunately I think I have parentheses in exactly the right places now.

And, because it was driving me mad, I fixed indentation for pretty-printing if-else ladders. They used to zig-zag further to the right, and sometimes some branches would all be on one line while other branches were split across multiple lines. Fixing this had no effect on correctness of the output but it's substantially more readable now!

Coming up next...

I'm preparing a new repository for the CVS source code that will include scripts to let anyone reproduce my progress on that translation. I'm having a little trouble scripting some of the steps I do, so that's taking longer than I hoped, but it'll happen at some point.

A significant fraction of source files in CVS contain either switch or goto statements so my major focus short-term is on translating those control-flow constructs correctly. If you'd like to help, check out the discussion around goto and switch.

I'll post progress reports like this regularly. Keep an eye out!

4 comments:

  1. Are you familiar with the OpenBSD project's OpenCVS? Its express goal is to be a drop-in replacement for GNU CVS, inasmuch as that does not compromise the system's security. Its development was prompted by published vulnerabilities in GNU CVS.

    The CVS code remains useful as a testbed for Corrode, but CVS effectively is being actively maintained; it sounds like it's a "this needs to be distributed by the package repositories" problem (politics? elbow grease? advertising?) more than a purely technical problem.

    What was the runner-up in the "unmaintained but popular" race? :)

    ReplyDelete
    Replies
    1. I did not know about OpenCVS! I think I'll continue working on "classic" CVS as my testbed project, but assuming that goes well, maybe the OpenCVS folks would like to try running Corrode over their code base. :-)

      I notice that the OpenCVS project has various warnings plastered all over it (not yet portable beyond OpenBSD, not to be trusted on repositories whose contents you care about, etc.) so either the documentation is out-of-date or there are more reasons it's not yet being used than just lack of awareness.

      I didn't find any other projects that were anywhere near as good for my purposes as CVS—partly because I stopped looking once I found CVS, of course! I considered a few others, but the other cases all had several common problems: 1) I couldn't get them to compile even with GCC, so Corrode had no hope, and 2) hardly anyone cared about them. Those two problems are probably strongly correlated...

      Delete
    2. Ah, bother. I'm used to OpenBSD doing a bang-up job, but looks like that got merged to trunk and then was left to whimper; it never got to where it replaced GNU CVS.

      I can see that it's been getting some maintenance alongside the rest of the tree, but the project appears to still be using GNU CVS (with modifications, maybe?) as its main driver.

      Projects that can still be compiled and projects that people care about probably do go together, yeah. It's tricky if you need to use an old version of a maintained project for whatever reason, as well.

      Corrode sounds like an interesting tool for learning Rust by autogenerated example if you're a C dev, as well…

      Delete
    3. Regarding using Corrode as a "tool for learning Rust by autogenerated example if you're a C dev": sure! I've certainly learned a lot about Rust by trying to find equivalents to C constructs that I'm used to, so maybe it'll be useful that way for other people too. On the other hand, it probably isn't a good way to learn to write *good* Rust... ;-)

      I've been fascinated to discover that the opposite direction is more educational: Reading Corrode's output has taught me a lot about what the C source I hand it actually means, which surprised me given how long I've been programming in C. There's a lot implicitly happening behind the scenes in a C compiler, and making that explicit in Rust makes it clear how complicated even a small snippet of C can be.

      Delete