Friday, July 15, 2016

Testing strategies for Corrode

Last week I posted about Corrode, my C-to-Rust translator.

I'm quite proud of Corrode, to be perfectly frank! And apparently other people think it's cool too, since it was on the front page of Hacker News and /r/rust for a solid day, and it spent the subsequent week on GitHub's list of trending projects. The past week has been very exciting, I gotta say!

A moment of reflection, then. There are a few things an open source project really ought to have:
  1. Responsive maintainers. People have been trying Corrode on their own code and filing issues, and a few people have been busily submitting pull requests. I've been responding to these as quickly as I can.
  2. Solid documentation. I think documentation is pretty well covered, considering some of the feedback:
    • "Absolutely blown away by the detail of the documentation."
    • "The documentation is amazing."
    • "That's an impressive use of literate programming."
    • "WOW... This is just amazing. The most important part of the source is highly educative literate haskell."
    • "Who are you and what did you do with Jamey?" —a former co-worker who had to maintain my code after I left
  3. Verification/validation. Um... well... did you see my documentation?
So it's time for a real plan for verifying that Corrode does what it's supposed to. I've explored several strategies so far, which have interesting trade-offs.

Randomized testing with Csmith and C-Reduce

"Csmith is a tool that can generate random C programs that statically and dynamically conform to the C99 standard. It is useful for stress-testing compilers, static analyzers, and other tools that process C code."

Corrode now has a script in scripts/csmith-test which performs the following steps:
  1. Generate a random C program using Csmith, and verify that the program compiles using GCC, as our reference compiler. If either of these fail, we won't discover any interesting Corrode-related bugs with this program, so discard it.
  2. Convert the program to Rust using Corrode, and compile it with rustc. If either of these fail, this may indicate an interesting bug; report it to the user.
  3. Run the GCC-compiled program. If it times out, or exits with an error, or produces output that doesn't look right for a Csmith-generated program, then odds are low that we'll learn anything from the Rust version and we just discard this case.
  4. Run the Rust-compiled program and compare its output to the GCC-compiled version. If they're different, that probably indicates an interesting bug; report it to the user.
Usually it only takes a few seconds to run through that process; but on the other hand, usually these tests all pass :-) so I've been running this script in a loop which stops once an interesting failure is actually found.

Once the script has produced an interestingly-buggy test case, then it tries to delete as much code from the test as possible while preserving the bug that made it interesting. C-Reduce does the heavy lifting for this part, using a variety of heuristics starting with "what happens if I delete this line? No? How about this one?"

Unlike the Csmith test cycle, C-Reduce can easily run for an hour or two trying to find chunks that are OK to delete. At the moment this is my only complaint about C-Reduce: I have so many possible bugs to investigate that spending an hour or two waiting on each one feels like a lot. That said, it's doing a very good job of narrowing down the portions of code actually responsible for bugs, so I guess I'll keep it. :-)

So far I've found and fixed three bugs in Corrode using Corrode's csmith-test script.
  • I applied C99's "usual arithmetic conversions" to the left-shift and right-shift operators, but I should instead have applied the "integer promotions" to both operands and then taken the left-hand side's type. Fixed in commit 3fd9040.
  • I never realized that the type of a C99 integer literal depends not only on whether it has 'U' or 'L' suffixes, but also on whether it's written in decimal (as opposed to octal or hex), and also on whether its value fits in a given type. Fixed in commit 93671a7.
  • Apparently, I find the "usual arithmetic conversions" hard to get right. I've "fixed" them in three commits in a row and Csmith keeps finding new corners for me.
My big problem with the usual arithmetic conversions is that I'm trying to translate C's long to Rust's isize, but in various corner cases that means you can distinguish between Corrode and a standard-conforming C compiler. I may just decide that's a "feature"...
I have not found any cases yet where I could blame rustc for any wrong output, but you never know... it could happen!

Building real software: musl libc

The musl library is a from-scratch implementation of the C standard library, which "is lightweight, fast, simple, free, and strives to be correct in the sense of standards-conformance and safety."

Adam Perry has been manually porting musl to Rust as an exercise to discover "the advantages and pitfalls of a C-to-Rust rewrite project," so I thought it might be a good candidate for testing Corrode.

Of particular importance for my purposes, musl doesn't use system headers, which means that compared with other real C code-bases, musl less frequently uses parts of C that Corrode doesn't support yet.

To test Corrode on musl, I first made the corrode executable accept all the options that GCC/Clang do (more or less), and just pass anything it doesn't understand to the C preprocessor.

With that, you can run "make CC=corrode -k" and get a "foo.rs" file alongside every "foo.c" file that Corrode was successfully able to translate! The "-k" option tells make to keep going as long as possible even when some commands fail.

At the moment, Corrode translates about 10% of the source files in musl, which I feel is pretty good! I was especially excited when I got it to translate bsearch.c and a bunch of the string.h functions, since those do funny things with pointers...

Now, I haven't yet tested whether any of these source files translate correctly, and in fact I know a few of them don't (if they have static local variables, or declare arrays). At the moment all I can tell you is that the output looks mostly reasonable, in my human judgment.

At some point I'd like to integrate Adam Perry's build and test scripts from his manual rewriting effort, which would let me validate that Corrode+rustc generate code that passes musl's test suite.

Software Analysis Workbench

"The Software Analysis Workbench (SAW) provides the ability to formally verify properties of code written in C, Java, and Cryptol. It leverages automated SAT and SMT solvers to make this process as automated as possible, and provides a scripting language, called SAW Script, to enable verification to scale up to more complex systems."

Csmith-generated programs print some computed values just before exiting, and that's all you can observe about them during testing. If there are errors mid-computation but they cancel out by the end, or otherwise have no effect on the final output, Csmith-based testing won't find those.

SAW, by contrast, is designed to show that two functions compute equivalent outputs. In principle, we could use SAW to prove that the code clang generates is equivalent to the code which Corrode+rustc generate on a per-function basis.

Unfortunately, current git master of SAW can't parse the LLVM bitcode which rustc produces. I hear SAW's maintainers are working on that, so I'm hoping this will be an effective option for Corrode testing soon!

Unit tests and randomized property testing

You were probably wondering why I haven't mentioned these yet, I bet?

Csmith was low-hanging fruit: It has proven to be effective at finding bugs in Corrode with very little work on my part. And musl has quickly shown me which C constructs are used by a fairly substantial real-world code base, so I could focus on implementing those.

Unit tests and QuickCheck-style property testing, on the other hand, require a lot more thought from the developer. I just haven't put in that level of thought yet.

As Corrode matures, I think we'll need both focused regression tests and randomized property tests to test at every commit, which can be supplemented by periodic longer-duration Csmith runs.

Conclusions

Corrode doesn't support enough of C yet to translate most real programs, largely due to system headers declaring things like unions and enums even if your program never uses them. But for the subset of C which Corrode does support, at this point I feel pretty good about the correctness of its translation!

No comments:

Post a Comment