Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Jamie Zawinski

Regular expressions are frequently used for tasks where they aren’t a great choice. On the other hand, the implementations of regular expressions that we all use are much less capable than formal-languages theory allows, so they go unused in many cases where they could be a great fit.

Motivation: data recovery

Years ago I was working on a project that involved tedious data collection, performed by non-technical people all over the world, when we started getting sporadic reports of data corruption. We gathered the corrupted data files and found that there were 512-byte chunks either re-ordered or just outright missing. We really didn’t want to ask people to just try collecting the same data again, because it was difficult to collect the first time. And we couldn’t reproduce or identify the cause of the data corruption, so we expected we’d see more cases pop up. Instead, I set out to see what we could recover from the corrupted files.

For reasons which had seemed like a good idea to people previously working on this project, the files were “encrypted” using a homebrew cipher. Each file indicated at the beginning which cipher parameters it was encrypted with, and the cipher state changed continuously over the course of the file. So losing some blocks and re-ordering others meant I didn’t know any of the cipher state at any given point, and therefore recovering any data seemed impossible.

However: It’s widely understood now that you should never roll your own cryptography, and this project illustrated why. The cipher was complex enough to look like an effective secrecy measure at first glance, but the ciphertext it produced actually leaked a lot of information about the plaintext and the cipher state.

Just looking at ciphertext side-by-side with the corresponding plaintext, I could easily see the same patterns in both. The plaintext had a simple structure that I could easily describe with a regular expression, and in an early experiment, I was able to write a related regular expression that could distinguish between ciphertext from these files versus unrelated data. Despite the apparent complexity of the cipher, it failed these basic tests.

In the end, I was able to write a program that would decipher any 512-byte block in a fraction of a second. It also reported what the reconstructed cipher state was at the beginning and end of each block, which allowed me to put blocks back into the right order by finding the edges where the cipher states matched. I wrapped the analysis tools I built into a self-contained program that automated the process of recovering as much data as possible. That project continued using this cipher-cracker which I built for years after I left.

I’m not going to give any details about the actual cipher: I was politely asked not to, and besides it’s not actually that interesting. But I will show you how to recover cipher state from a much simpler cipher when you know the structure of the plaintext, using the same technique I used.

Example: Caesar ciphers over structured plaintext

A Caesar cipher, or shift cipher, is a particularly simple kind of cipher, but it’s still complex enough to illustrate the technique I want to focus on today. It involves “rotating” the alphabet a certain number of positions, where the amount of rotation is a secret shared between the sender and receiver of the message. For example, a shift of 3 means that “A” becomes “D”, “B” becomes “E”, etc. When you get to the end of the alphabet, wrap back around to the beginning: “Y” becomes “B” and “Z” becomes “C”.

Now let’s say that you’ve intercepted a secret message where you know the contents are obscured with a Caesar cipher, but you don’t know the secret rotation amount. If you assume the message was written in, for example, English, then there are well-known ways to break this cipher. One way is to count how often each letter occurs in the ciphertext and guess that the most common letter represents the most frequently used letter in English (perhaps “E” or “T”).

But instead let’s say the message is produced by a machine, and its messages don’t match English letter frequencies. If you know a regular expression which should match the plaintext, you can break this cipher a different way.

Let’s look at some simple cases first.

If the machine only produces messages using a single letter, then the ciphertext must also contain repetitions of a single letter. If the plaintext should match the regular expression A+, and you see that the ciphertext matches D+, then you know that the message was shifted by 3 letters.

Similar reasoning works if the machine only uses a few letters of the alphabet. If the plaintext should match [A-D]+ and the ciphertext contains both the letters “B” and “E”, then the only possible shift is 1: no other shift amount could produce both of those letters.

Or perhaps the message matches the concatenation of multiple different regular expressions. Take, for instance, [A-D]*E[FG]*. If we knew where the sole “E” was in a message we could look at the corresponding position in the ciphertext and immediately determine which shift was used. But in this case the “E” could be the first letter if there aren’t any “A” through “D” letters at the beginning; or it could be the last letter if there’s no “F” or “G” at the end; or it could be anywhere in between. Still, if we observe a “C” somewhere, and then later a “G”, and after that an “I”, then the shift must have been 2, because those three observations can only be explained if we assume the first came from the [A-D]* subexpression, the second from the E, and the third from the [FG]* subexpression.

In general, you can “encrypt” any regular expression using a given guess about what the shift might be. In the last example above, encrypting [A-D]*E[FG]* with a shift of 2 yields [C-F]*G[HI]*. If the new expression matches the ciphertext, then that guess could be the right answer. If you try every possible shift amount and only one guess matches, then that guess must be the right answer.

Generalization: Weighted regular expressions

Instead of running the regular expression matcher 25 times for each of the 25 possible shift amounts in an English-alphabet Caesar cipher, it’d be nice to run the matcher once and keep track of which possibilities worked out. That’s especially important if your cipher has a larger state space. 25 possibilities isn’t a big deal, but millions or more might be.

I’m going to show how to do this by adding support for “semiring weights” to a regular expression matcher and by then defining an appropriate type of weight. I learned about this approach from a paper called “A Play on Regular Expressions”, which was published at ICFP 2010. That is, to date, my favorite academic paper of all time, and I highly recommend reading it, but I’m going to give a different presentation of the key ideas here.

To begin with, let’s think about regular expression matching like playing a guessing game. Consider this regular expression:

(AB|CD)*A

If the first character of input is an “A”, which “A” is it matching? There are two possibilities: we might be expecting a “B” next, or the end of the input. Let’s mark those two possible positions in this regular expression with a dot:

(A·B|CD)*A·

Now if we have another character of input, then we can remove the dot at the end: that possibility didn’t pan out. If that character is a “B”, however, the first dot is still a valid guess and we should move it forward, past the “B” in the pattern. There are three possibilities at that point, and they are the same as the initial state: we’re expecting either of the “A”s or a “C”.

(·AB|·CD)*·A

If we observe a “C” at this point, the corresponding dot is the only valid guess. We can discard the other two and advance that dot past the “C” in the pattern:

(AB|C·D)*A

In general, imagine keeping track, at each character in the regular expression, of whether the next character of input is allowed to match that character in the pattern. This mark is a boolean value, true or false.

I’m hand-waving over one important detail: When matching input one character at a time and propagating dots forward, you can keep moving each mark along until you reach an actual letter (or character class). At that point you have to stop and save the mark state for the next round of matching, because you can’t tell whether that letter will match until you read the next character of input. But no other variety of pattern needs to store any state between rounds. (Short-circuiting state for these other operators is equivalent to “epsilon closure” in the theory of non-deterministic finite automata.)

Caesar weights

Now let’s extend those boolean marks to have a separate boolean flag for each “encrypted” variation of the pattern that we want to match. So instead of one bit per letter in the pattern, we now have 25 bits per letter. The first bit at each letter represents whether the pattern has matched to that point given a Caesar shift of 1; the second bit is for a shift of 2; etc.

Initially, before looking at any input, we should assume that all 25 possible shifts are valid candidates. Over the course of matching we’ll narrow that set.

How should we propagate these Caesar cipher possibility bit-sets around the regular expression while we’re matching input? Let’s consider one regular-expression operator at a time.

For a subexpression which matches a single letter, we can compare the next input character against the pattern and determine which shift amount would match. If the pattern is A and the input is D, then only a shift of 3 can match, so we can construct a bit-set where only the third bit is set. But we know which shifts were valid possibilities for earlier parts of the input, so we can bitwise-AND the prior bit-set with the new one we just constructed. If the third bit was considered valid before, then it’s the only valid option now; otherwise there are no valid shifts for this part of the pattern at this point in the input, and we can discard this possibility entirely.

How about the sequence AB? We have a mark point before “A” and another before “B”, and each of those points holds a 25-element bit-set. The marks before “A” get fed through the above process to produce a new bit-set to be used at “B” next time; similarly, the marks before “B” are updated and fed on to whatever follows this subexpression.

Alternation patterns like AB|CD take whatever mark came in from the preceding expression, feed it into both alternative subexpressions, and take whatever comes out and bitwise-OR them. In other words, any candidate shift amount which matched either subpattern is a valid shift amount for the pattern as a whole.

The “star” operator (*), matching zero or more copies of its subpattern, has mark bit-sets coming from its preceding expression and from its subexpression. The subexpression might match if either the preceding expression, or a previous iteration of the subexpression, just finished matching, so we feed the bitwise-OR of those two bit-sets in as input to the subexpression. Similarly, the subsequent expression might match if the preceding expression matches and the star operator matches zero times, or if the subexpression outputs that it matched, so those two bit-sets are also combined with bitwise-OR and fed onward to the subsequent expression.

At this point we’ve only needed four operations for our Caesar weights:

  • An initial value, where all shifts are considered valid
  • Bitwise-AND to narrow our set of guesses when concatenating subexpressions
  • Bitwise-OR to track additional guesses in alternation and “star”
  • A function which, given an input character and an expected character (class), returns the bit-set of valid shifts for that combination

In terms from abstract algebra, the first three resemble a “semiring”. Bitwise-AND works somewhat like multiplication on numbers, while bitwise-OR works somewhat like addition, and the initial value satisfies the same laws as the number “one”. To check whether the regular expression failed to match you also need to check if a value is like the number “zero”; in Caesar weights, that’s an empty bit-set.

I’m not convinced that all of the semiring laws are actually necessary or desirable in this application. For example, requiring commutativity does not seem to be necessary for either operator, and prevents some more sophisticated uses. But I haven’t yet convinced myself of which laws are actually required.

Related work

With the observation that marks can be moved forward until encountering a literal character in the pattern, the matching algorithm as I’ve described it resembles Glushkov’s DFA construction algorithm, if you squint. “A Play on Regular Expressions” explicitly describes the approach they follow as being Glushkov’s, but it took me years to understand why. I hope my presentation here makes the connection a little more clear.

“A Play on Regular Expressions” extends regular expression matching to context-free grammars by allowing the regular expression to be a lazily-constructed infinite tree. Matt Might et al subsequently wrote a pair of papers doing the same for regular expression matching based on Brzozowski derivatives: “Parsing with Derivatives” at ICFP 2011, and “On the Complexity and Performance of Parsing with Derivatives” at PLDI 2016.

By 2019 I had confused myself with all of the above papers, leading to this post:

tfw you mix up Glushkov’s construction with Brzozowski’s derivatives, amirite?

no? that’s just me?

That post led Magnus Holm to create a neat parser generator for context-free grammars called Glush. That’s probably the most value for the world of any shitpost I’ve written.

The “dot” notation I used to illustrate marks in this post is the same as notation commonly used to describe Earley parsing. I believe, but have not yet demonstrated, that Earley parsing can be extended with semiring weights in the same way that “A Play on Regular Expressions” did for Glushkov’s construction. Such a parser should be in the same complexity class as Earley parsers: linear-time when used on regular languages; quadratic on unambiguous context-free grammars; and worst-case cubic on ambiguous context-free grammars. As a result my hypothesis is that the ideal implementation of a weighted parsing library would be based on the Earley algorithm, subsuming the need for a regex-only library. Proof of this hypothesis is left as an exercise for the reader.

Key tools for improving matching performance in regular expressions are conversion to a deterministic finite automaton (DFA), followed by DFA minimization. Determinizing and minimizing weighted automata is trickier, making it less obvious how to get optimal performance from a regular expression matcher that has been extended this way. One paper I was reading recently which might help is “Fast Coalgebraic Bisimilarity Minimization” by Jules Jacobs and Thorsten Wißmann. Additionally, they have a proof of concept Rust implementation called Boa.

Conclusion

It’s possible to parameterize a generic regular expression matcher over semiring-like operations. Simple boolean marks have a trivial implementation of the operations and allow simply checking whether the regular expression matched or not, just like any other regular expression matcher.

But by offering a parameterized interface, a regexp matching library can support some pretty wild use cases that are impractical or impossible with the regexp APIs in wide use today.

I’ve given one example: evaluating a regular expression that was written in terms of plaintext against a ciphertext in order to identify which cipher state can explain the observed ciphertext.

Another example which I won’t go into detail on today is incorporating the common notion of “capture groups” into a simpler matcher algorithm by representing them using appropriate weights. This requires a small extension to the interface to allow weights on epsilon transitions, not just when matching input. For some applications it also requires giving the developer more control over how ambiguous patterns are interpreted, via “greedy” versus “non-greedy” variants of operators, which in turn requires a notion of “match priority” that doesn’t fit in the framework of “A Play on Regular Expressions”.

Fortunately, weighted regular expressions are pretty easy to retrofit into the efficient matching algorithm described in “Regular Expression Matching: the Virtual Machine Approach”, by Russ Cox.

I’ve written a demo implementation of the Pike-style VM approach with capture groups, as described by Cox, in a Rust crate called pikevm. I’ve also written a demo of weighted regular expression matching in a Pike-style VM, in an independent Rust crate called parsink. Combining the two and making them into a robust and usable library is left as future work.

I hope this post has broadened your view of the situations where regular expressions might be a good tool. And I especially hope to see new, more flexible, tools emerge for extracting information from text based on formal language theory.