Friday, July 8, 2016

Translating C to Rust (and how you can help)

I've been working for a few months on a C to Rust source code translator called "Corrode", and it's starting to come together into something that looks promising!

Corrode is kind of a weird project in that it's really not very much code—less than a thousand lines at its core!—and yet people expect it to be this huge scary thing. Admittedly, it has required a lot more thought per line of code than many programs do, so maybe you should be a little scared of the scope of a project like this.

But...

I'd like to convince you that you can contribute to Corrode's development, and in this post I'll talk about some reasons why it's easier than it sounds, and some of the ways I'm trying to make first contributions as painless as I can.

(If you just want to read up on how Corrode does its magic, or learn about some bizarre corners of C, check out the detailed design documentation.)

Type-directed development

Corrode is implemented in Haskell. Wait, come back! It's no big deal, I swear: Just about everything I wrote has a simple Rust equivalent, just with different syntax. The only reason I wrote Corrode in Haskell is because I couldn't find a complete enough C parser for Rust, and I was already familiar with the language-c parser for Haskell. So I believe that if you can write Rust, you can pick up enough Haskell to work on Corrode.

In most compiler front-ends, the parser's job is to construct an "abstract syntax tree" (AST), and language-c is no different. What does an AST look like? Take a peek at the Language.C.Syntax.AST module. It defines a series of types, where each type represents some piece of C syntax. For example, a CExternalDeclaration represents one thing that can appear at the top-level of a C source file (that is, outside of any function), while a CExpression represents one operator and its operands (like "a + 2").

So something you might wonder about a project like Corrode is: how can we tell when we're done? The neat thing is that these AST types tell us exactly what cases we have to handle before we'll be able to translate all legal C programs. Once we've written down a translation for every constructor of CExpression, for instance, there is no way for language-c to hand us some magic other kind of expression it hadn't previously told us about.

As a result, the process of creating Corrode has been an almost mechanical cycle of finding one of the unimplemented cases, thinking about what the equivalent Rust should be for that case, and writing code to generate that Rust.

Corrode defines a couple of common error-reporting functions, and one of them is called "unimplemented". One quick way to find some place you could contribute is to search for calls to that function and see if any of the missing cases looks like something you know how to tackle. (As of this writing, there are 11 such calls, although some are catch-all handlers for several unimplemented cases at once.)

Open issues tagged "easy"

Of course, some of the cases are not yet implemented because they're actually super difficult to handle correctly. For example, the switch and goto statements are in that category, for reasons I discussed in an earlier post (How to eliminate "goto" statements). Just searching for calls to unimplemented won't tell you whether you've picked one of the tricky cases.

As an alternative, I've configured Corrode's issue tracker with a label called "easy", which I use on issues that I think are a good introduction to hacking on Corrode. I am deliberately avoiding implementing these cases, even though I know how I would implement them, except when they keep me from finishing something else.

While I was a resident at the Recurse Center a few weeks ago (which was a wonderful experience, by the way, and highly recommended), some of the "Recursers" pair-programmed solutions to some of these easy-tagged issues with me. I was so happy to get to merge pull requests from Jeremie Jost (implement alignof) and Nabil Hassein (implement global variables). Jeremie also got a pull request merged in language-c for an issue we discovered along the way. These were all fairly small patches, and you could do this too!

Literate programming

Literate programming turns the usual emphasis of programming on its head. Instead of writing a bunch of machine-readable source code that has human-readable comments interspersed, in literate programming you write human-readable documentation that has some machine-readable source code scattered in it.

For Corrode, there isn't very much source code, but the assumptions and rationale behind the implementation are intricate. I was going to write a detailed design document to help other people get started hacking on this project. But then I realized that I'd want the documentation to refer to specific parts of the implementation, and when reading the implementation I'd want the relevant parts of the documentation to be easy to find.

The literate programming style is perfect for this! Check out the core source file of Corrode... which is just a Markdown document. My goal was to tell the story of why and how, and as you read that story the what is right there for easy cross-referencing.

GitHub renders the Markdown nicely, including syntax highlighting for the code snippets. (And you can use pandoc to generate PDF or other formats from it too.) I also throw in snippets of C or Rust source, which help when explaining some of the quirks but are stripped out along with the rest of the non-Haskell text during compilation.

If you want to use Markdown in your own Literate Haskell source, there are a couple of tricks to it:
  • Configure GHC to pre-process .lhs source files with markdown-unlit, by adding it to your .cabal file's build-depends and adding "-pgmL markdown-unlit" to your ghc-options. (That package's README documents how to turn your README into a test suite where all your code snippets are actually tested at build time, which is a very cool idea, but wasn't quite what I wanted in this case.)
  • Name your literate Haskell source files with a ".md" extension, so GitHub treats them as Markdown instead of as source code. Then create a symlink from "foo.lhs" to "foo.md" so GHC can find the module too.
Of course, I know how all the code I wrote works, so there are likely sections where my explanation of what's going on doesn't actually clarify anything. I would dearly love to get pull requests that improve the documentation without changing the code at all. If you find some of the text confusing, see if you can find a better way to say it! This is an excellent way to gain a deeper understanding of how the code works, too. Huge thanks to Amin Bandali for sending in one documentation pull request already, explaining how to compile and run Corrode. (Oops, I should have thought of that!)

Conclusion

I would very much like to work with you to make Corrode into a tool that many C programmers can use to try switching to Rust. You can help:
  • by improving the documentation, so more people will feel comfortable contributing to and using Corrode;
  • by writing translations for C features which are not yet implemented, especially features I've tagged as "easy";
  • or just by trying it out on your own C source and reporting issues as they come up.
I look forward to seeing your pull requests! :-)

1 comment:

  1. i am also trying to write a transpiler for myself want to learn more. your project will help me a lot. still can you suggest me any resources for further study?

    ReplyDelete