Friday, December 9, 2016

How to translate a large C project to Rust

In October, I started working on translating CVS from C to Rust, and today I'd like to answer these questions about how that's going:
  • How does a Corrode-aided porting effort work, in practice?
  • How much progress have I made on CVS, and how hard would it be to do the same for other projects today?

How to translate a new project using Corrode

Here's the process I've followed while working on translating CVS to Rust. You can find the scripts and patches described here in my cvs-rs repository.

0. Does it build?

Before doing anything else, I made sure that I could build CVS from unmodified source using GCC. This is important! If it doesn't work with a standard C compiler, there is absolutely no way Corrode is going to give you good results. Or as Charles Babbage put it:
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
If the project you're translating has a test suite, this is also a good time to check that the test suite passes before you start changing things!

1. Trial run

Next, to get a rough idea of how much work the translation will eventually require, I tried just substituting Corrode and rustc in place of GCC in the build. Most build systems for open source projects make it easy to do that, usually by setting a "CC" variable to the path you want to use as your C compiler. In the case of CVS, I did it this way:
make CC=/path/to/corrode/scripts/corrode-cc
corrode-cc is a wrapper script that does these steps:
  1. Run Corrode to translate one C source file to a standalone Rust module.
  2. Run rustc on that Rust module, asking it to treat it as a dylib-type crate but to only emit an object file.
  3. If either of those steps fails, save the error message and then run GCC to get an object file anyway.
So the result of this step is that we have a bunch of Rust modules lying around, plus a bunch of files recording various error messages. Hopefully the build completed successfully, and if we're lucky, there's even some Rust-compiled code in it, and if we're really lucky, the new binary still works!

To estimate how much work the translation is going to take, you can look at several factors:
  • How many distinct errors did the build produce? The corrode-cc script writes error messages to files named "errors-<hash>", where the hash is over the error message itself, so if multiple files run into identical errors in some header file that they all include, that error will only show up once. Error messages:
    • may indicate that the project relies on a feature of C that Corrode does not yet handle,
    • or may indicate a bug in Corrode,
    • or may indicate code that Rust can't verify is safe.
  • How many object files (*.o) did the build produce?
  • How many Rust modules (*.rs)? In the best case, there will be one for each object file, but currently Corrode fails to translate a variety of C patterns, and whenever Corrode fails it refuses to generate any output. Those cases may indicate that you'll need to patch the C source to make it easier to translate. This can be tricky.
  • How many of the object files were compiled via Rust? If the corrode-cc wrapper script had to fall back to compiling via GCC, then there are usually small edits you can make to the Corrode-generated Rust to make rustc accept it. This is tedious but generally pretty easy.

I found that the easiest way to check whether an object file was compiled using GCC or Rust was to check if the string ".rs" appears in it:
grep -lF .rs *.o
(Of course this might have false positives, so if you have a better approach please let me know!)

Based on these results you should get some idea how close Corrode alone will get you, and how much manual work you'll need to do to complete the translation.

2. Integrate Rust into the build system

For the CVS case study, I wanted use Corrode as if it were just another C compiler. So the C source is the canonical implementation, and I patched the build system to compile some source files via Rust instead.

Which source files should you do this to first? Maybe pick just one that worked in step 1.

For CVS, this amounted to making the following edits to src/
  • Delete a selected subset of C sources from the cvs_SOURCES variable.
  • Create a RUST_SOURCES variable where, for example, if I removed "checkin.c" from cvs_SOURCES, then I added "" to RUST_SOURCES.
  • I added these rules: CC=corrode %.c
            $(COMPILE) -c $< >/dev/null
    .PHONY: rust-sources
    rust-sources: $(RUST_SOURCES)
    libcvs-rs.a: $(RUST_SOURCES)
            rustc -A bad-style -A unused-mut -g -O \
            -C debug-assertions=on --crate-type=staticlib \
            -o $@ $<
  • Finally, I added libcvs-rs.a to cvs_DEPENDENCIES, and "libcvs-rs.a -ldl -lpthread -lgcc_s -lc -lm -lrt -lutil" to cvs_LDADD.
Also, I created a top-level Rust module in src/ which just re-exports declarations from the translated modules. So if is in RUST_SOURCES, contains a "pub use checkin;" item.

Note that I split out a phony target just for ensuring that all the Rust sources have been built. That allowed me to split the build process into phases that run before and run after Corrode:
  1. Apply patches to the C source as needed that make it easier to translate.
  2. Run "make rust-sources" to auto-generate a rough version of each selected C source file.
  3. Apply additional patches, this time to the generated Rust, as needed to improve the translation.
  4. Run "make" to complete the build.
As Corrode and related tools improve, there should be less need for patches to either the C or the Rust source. If someday we can fully automate this process, then this multi-phase build approach can go away entirely.

I'm using quilt to manage the collections of patches. I have two reasons for doing it this way, and maybe neither will apply to you:
  • I wanted people to be able to learn from the cvs-rs repository, so I'm using the patch series as a way of communicating aspects of the process I'm following. If you're just doing a one-off conversion, you don't necessarily need to document the steps you took along the way.
  • Corrode is still under active development, so I'm frequently re-running it. Recording all the manual changes I'm making in separate patches makes it easier for me to manage my work-in-progress.

3. Translate more, or translate better

With that foundation in place, now comes the fun part: namely, "everything else!"

My current process for translating CVS involves doing either of these two tasks, over and over:
  • Pick a new C source file, move it to the build-via-Rust list, and see if it works.
  • Pick some piece of generated Rust, and see if I can improve it (by making it safer or more idiomatic).
You can keep doing these steps until there's nothing left to do, or you get bored.


So far, I have translated 6.4% of the non-comment, non-blank lines in the src/ subdirectory of CVS, from 10 source files.

Sometimes, translating a thousand-line source file has taken 10 minutes. Other times, I've spent an entire afternoon comparing the generated Rust to the original C without spotting any differences, and yet the Rust version doesn't pass the test suite.

So there's more work to be done on Corrode, to make it reliably convert as many kinds of C source as possible. At this point, I'm going back to improving Corrode for a bit, rather than focusing on translating more of CVS.

Still, if you're interested in trying Corrode, I'd encourage you to try going through at least step 1 on whatever project you think is interesting. See how far you get, and if you find a project where Corrode works well, I would love to hear about it!

Discussion elsewhere on this post:


    1. Might I suggest using git-series for the patch series management? One upside of it is that the patch series itself is versioned, and also exposed in a way that can be inspected with classic Git tooling. In addition, it offers tooling for managing a cover letter and so forth, which further helps with using it as a pedagogical tool. In addition, it's written in Rust :D

      1. There is some irony in using GIT to port CVS.