non-O(n) musings

Breaking ciphers with regular expressions

2023-07-04T19:07:20-07:00

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

— Jamie Zawinski

Regular expressions are frequently used for tasks where they aren’t a great choice. On the other hand, the implementations of regular expressions that we all use are much less capable than formal-languages theory allows, so they go unused in many cases where they could be a great fit.

Motivation: data recovery

Years ago I was working on a project that involved tedious data collection, performed by non-technical people all over the world, when we started getting sporadic reports of data corruption. We gathered the corrupted data files and found that there were 512-byte chunks either re-ordered or just outright missing. We really didn’t want to ask people to just try collecting the same data again, because it was difficult to collect the first time. And we couldn’t reproduce or identify the cause of the data corruption, so we expected we’d see more cases pop up. Instead, I set out to see what we could recover from the corrupted files.

For reasons which had seemed like a good idea to people previously working on this project, the files were “encrypted” using a homebrew cipher. Each file indicated at the beginning which cipher parameters it was encrypted with, and the cipher state changed continuously over the course of the file. So losing some blocks and re-ordering others meant I didn’t know any of the cipher state at any given point, and therefore recovering any data seemed impossible.

However: It’s widely understood now that you should never roll your own cryptography, and this project illustrated why. The cipher was complex enough to look like an effective secrecy measure at first glance, but the ciphertext it produced actually leaked a lot of information about the plaintext and the cipher state.

Just looking at ciphertext side-by-side with the corresponding plaintext, I could easily see the same patterns in both. The plaintext had a simple structure that I could easily describe with a regular expression, and in an early experiment, I was able to write a related regular expression that could distinguish between ciphertext from these files versus unrelated data. Despite the apparent complexity of the cipher, it failed these basic tests.

In the end, I was able to write a program that would decipher any 512-byte block in a fraction of a second. It also reported what the reconstructed cipher state was at the beginning and end of each block, which allowed me to put blocks back into the right order by finding the edges where the cipher states matched. I wrapped the analysis tools I built into a self-contained program that automated the process of recovering as much data as possible. That project continued using this cipher-cracker which I built for years after I left.

I’m not going to give any details about the actual cipher: I was politely asked not to, and besides it’s not actually that interesting. But I will show you how to recover cipher state from a much simpler cipher when you know the structure of the plaintext, using the same technique I used.

Example: Caesar ciphers over structured plaintext

A Caesar cipher, or shift cipher, is a particularly simple kind of cipher, but it’s still complex enough to illustrate the technique I want to focus on today. It involves “rotating” the alphabet a certain number of positions, where the amount of rotation is a secret shared between the sender and receiver of the message. For example, a shift of 3 means that “A” becomes “D”, “B” becomes “E”, etc. When you get to the end of the alphabet, wrap back around to the beginning: “Y” becomes “B” and “Z” becomes “C”.

Now let’s say that you’ve intercepted a secret message where you know the contents are obscured with a Caesar cipher, but you don’t know the secret rotation amount. If you assume the message was written in, for example, English, then there are well-known ways to break this cipher. One way is to count how often each letter occurs in the ciphertext and guess that the most common letter represents the most frequently used letter in English (perhaps “E” or “T”).

But instead let’s say the message is produced by a machine, and its messages don’t match English letter frequencies. If you know a regular expression which should match the plaintext, you can break this cipher a different way.

Let’s look at some simple cases first.

If the machine only produces messages using a single letter, then the ciphertext must also contain repetitions of a single letter. If the plaintext should match the regular expression A+, and you see that the ciphertext matches D+, then you know that the message was shifted by 3 letters.

Similar reasoning works if the machine only uses a few letters of the alphabet. If the plaintext should match [A-D]+ and the ciphertext contains both the letters “B” and “E”, then the only possible shift is 1: no other shift amount could produce both of those letters.

Or perhaps the message matches the concatenation of multiple different regular expressions. Take, for instance, [A-D]*E[FG]*. If we knew where the sole “E” was in a message we could look at the corresponding position in the ciphertext and immediately determine which shift was used. But in this case the “E” could be the first letter if there aren’t any “A” through “D” letters at the beginning; or it could be the last letter if there’s no “F” or “G” at the end; or it could be anywhere in between. Still, if we observe a “C” somewhere, and then later a “G”, and after that an “I”, then the shift must have been 2, because those three observations can only be explained if we assume the first came from the [A-D]* subexpression, the second from the E, and the third from the [FG]* subexpression.

In general, you can “encrypt” any regular expression using a given guess about what the shift might be. In the last example above, encrypting [A-D]*E[FG]* with a shift of 2 yields [C-F]*G[HI]*. If the new expression matches the ciphertext, then that guess could be the right answer. If you try every possible shift amount and only one guess matches, then that guess must be the right answer.

Generalization: Weighted regular expressions

Instead of running the regular expression matcher 25 times for each of the 25 possible shift amounts in an English-alphabet Caesar cipher, it’d be nice to run the matcher once and keep track of which possibilities worked out. That’s especially important if your cipher has a larger state space. 25 possibilities isn’t a big deal, but millions or more might be.

I’m going to show how to do this by adding support for “semiring weights” to a regular expression matcher and by then defining an appropriate type of weight. I learned about this approach from a paper called “A Play on Regular Expressions”, which was published at ICFP 2010. That is, to date, my favorite academic paper of all time, and I highly recommend reading it, but I’m going to give a different presentation of the key ideas here.

To begin with, let’s think about regular expression matching like playing a guessing game. Consider this regular expression:

(AB|CD)*A

If the first character of input is an “A”, which “A” is it matching? There are two possibilities: we might be expecting a “B” next, or the end of the input. Let’s mark those two possible positions in this regular expression with a dot:

(A·B|CD)*A·

Now if we have another character of input, then we can remove the dot at the end: that possibility didn’t pan out. If that character is a “B”, however, the first dot is still a valid guess and we should move it forward, past the “B” in the pattern. There are three possibilities at that point, and they are the same as the initial state: we’re expecting either of the “A”s or a “C”.

(·AB|·CD)*·A

If we observe a “C” at this point, the corresponding dot is the only valid guess. We can discard the other two and advance that dot past the “C” in the pattern:

(AB|C·D)*A

In general, imagine keeping track, at each character in the regular expression, of whether the next character of input is allowed to match that character in the pattern. This mark is a boolean value, true or false.

I’m hand-waving over one important detail: When matching input one character at a time and propagating dots forward, you can keep moving each mark along until you reach an actual letter (or character class). At that point you have to stop and save the mark state for the next round of matching, because you can’t tell whether that letter will match until you read the next character of input. But no other variety of pattern needs to store any state between rounds. (Short-circuiting state for these other operators is equivalent to “epsilon closure” in the theory of non-deterministic finite automata.)

Caesar weights

Now let’s extend those boolean marks to have a separate boolean flag for each “encrypted” variation of the pattern that we want to match. So instead of one bit per letter in the pattern, we now have 25 bits per letter. The first bit at each letter represents whether the pattern has matched to that point given a Caesar shift of 1; the second bit is for a shift of 2; etc.

Initially, before looking at any input, we should assume that all 25 possible shifts are valid candidates. Over the course of matching we’ll narrow that set.

How should we propagate these Caesar cipher possibility bit-sets around the regular expression while we’re matching input? Let’s consider one regular-expression operator at a time.

For a subexpression which matches a single letter, we can compare the next input character against the pattern and determine which shift amount would match. If the pattern is A and the input is D, then only a shift of 3 can match, so we can construct a bit-set where only the third bit is set. But we know which shifts were valid possibilities for earlier parts of the input, so we can bitwise-AND the prior bit-set with the new one we just constructed. If the third bit was considered valid before, then it’s the only valid option now; otherwise there are no valid shifts for this part of the pattern at this point in the input, and we can discard this possibility entirely.

How about the sequence AB? We have a mark point before “A” and another before “B”, and each of those points holds a 25-element bit-set. The marks before “A” get fed through the above process to produce a new bit-set to be used at “B” next time; similarly, the marks before “B” are updated and fed on to whatever follows this subexpression.

Alternation patterns like AB|CD take whatever mark came in from the preceding expression, feed it into both alternative subexpressions, and take whatever comes out and bitwise-OR them. In other words, any candidate shift amount which matched either subpattern is a valid shift amount for the pattern as a whole.

The “star” operator (*), matching zero or more copies of its subpattern, has mark bit-sets coming from its preceding expression and from its subexpression. The subexpression might match if either the preceding expression, or a previous iteration of the subexpression, just finished matching, so we feed the bitwise-OR of those two bit-sets in as input to the subexpression. Similarly, the subsequent expression might match if the preceding expression matches and the star operator matches zero times, or if the subexpression outputs that it matched, so those two bit-sets are also combined with bitwise-OR and fed onward to the subsequent expression.

At this point we’ve only needed four operations for our Caesar weights:

An initial value, where all shifts are considered valid
Bitwise-AND to narrow our set of guesses when concatenating subexpressions
Bitwise-OR to track additional guesses in alternation and “star”
A function which, given an input character and an expected character (class), returns the bit-set of valid shifts for that combination

In terms from abstract algebra, the first three resemble a “semiring”. Bitwise-AND works somewhat like multiplication on numbers, while bitwise-OR works somewhat like addition, and the initial value satisfies the same laws as the number “one”. To check whether the regular expression failed to match you also need to check if a value is like the number “zero”; in Caesar weights, that’s an empty bit-set.

I’m not convinced that all of the semiring laws are actually necessary or desirable in this application. For example, requiring commutativity does not seem to be necessary for either operator, and prevents some more sophisticated uses. But I haven’t yet convinced myself of which laws are actually required.

With the observation that marks can be moved forward until encountering a literal character in the pattern, the matching algorithm as I’ve described it resembles Glushkov’s DFA construction algorithm, if you squint. “A Play on Regular Expressions” explicitly describes the approach they follow as being Glushkov’s, but it took me years to understand why. I hope my presentation here makes the connection a little more clear.

“A Play on Regular Expressions” extends regular expression matching to context-free grammars by allowing the regular expression to be a lazily-constructed infinite tree. Matt Might et al subsequently wrote a pair of papers doing the same for regular expression matching based on Brzozowski derivatives: “Parsing with Derivatives” at ICFP 2011, and “On the Complexity and Performance of Parsing with Derivatives” at PLDI 2016.

By 2019 I had confused myself with all of the above papers, leading to this post:

tfw you mix up Glushkov’s construction with Brzozowski’s derivatives, amirite?

no? that’s just me?

That post led Magnus Holm to create a neat parser generator for context-free grammars called Glush. That’s probably the most value for the world of any shitpost I’ve written.

The “dot” notation I used to illustrate marks in this post is the same as notation commonly used to describe Earley parsing. I believe, but have not yet demonstrated, that Earley parsing can be extended with semiring weights in the same way that “A Play on Regular Expressions” did for Glushkov’s construction. Such a parser should be in the same complexity class as Earley parsers: linear-time when used on regular languages; quadratic on unambiguous context-free grammars; and worst-case cubic on ambiguous context-free grammars. As a result my hypothesis is that the ideal implementation of a weighted parsing library would be based on the Earley algorithm, subsuming the need for a regex-only library. Proof of this hypothesis is left as an exercise for the reader.

Key tools for improving matching performance in regular expressions are conversion to a deterministic finite automaton (DFA), followed by DFA minimization. Determinizing and minimizing weighted automata is trickier, making it less obvious how to get optimal performance from a regular expression matcher that has been extended this way. One paper I was reading recently which might help is “Fast Coalgebraic Bisimilarity Minimization” by Jules Jacobs and Thorsten Wißmann. Additionally, they have a proof of concept Rust implementation called Boa.

Conclusion

It’s possible to parameterize a generic regular expression matcher over semiring-like operations. Simple boolean marks have a trivial implementation of the operations and allow simply checking whether the regular expression matched or not, just like any other regular expression matcher.

But by offering a parameterized interface, a regexp matching library can support some pretty wild use cases that are impractical or impossible with the regexp APIs in wide use today.

I’ve given one example: evaluating a regular expression that was written in terms of plaintext against a ciphertext in order to identify which cipher state can explain the observed ciphertext.

Another example which I won’t go into detail on today is incorporating the common notion of “capture groups” into a simpler matcher algorithm by representing them using appropriate weights. This requires a small extension to the interface to allow weights on epsilon transitions, not just when matching input. For some applications it also requires giving the developer more control over how ambiguous patterns are interpreted, via “greedy” versus “non-greedy” variants of operators, which in turn requires a notion of “match priority” that doesn’t fit in the framework of “A Play on Regular Expressions”.

Fortunately, weighted regular expressions are pretty easy to retrofit into the efficient matching algorithm described in “Regular Expression Matching: the Virtual Machine Approach”, by Russ Cox.

I’ve written a demo implementation of the Pike-style VM approach with capture groups, as described by Cox, in a Rust crate called pikevm. I’ve also written a demo of weighted regular expression matching in a Pike-style VM, in an independent Rust crate called parsink. Combining the two and making them into a robust and usable library is left as future work.

I hope this post has broadened your view of the situations where regular expressions might be a good tool. And I especially hope to see new, more flexible, tools emerge for extracting information from text based on formal language theory.

Z3 adventures and fast Boolean matching

2021-09-05T18:14:33-07:00

Over the past few weeks I’ve been fiddling with Z3, a theorem prover and SAT/SMT solver from Microsoft Research. Z3 makes it pretty easy to write down some NP-complete problem instance and then push a button to get the answer. (Well, hopefully. I mean, they’re NP-complete problems, it’s hard.) It’s also open source (released under an MIT license), and, as I’ve learned, it has a delightful maintainer in Nikolaj Bjørner.

Today I’m going to first briefly tell the story of my recent adventures with Z3; then the bulk of this post will be a deep dive into testing equivalency of Boolean functions, and the experiments those adventures have motivated me to try.

Z3 adventures

At first I was just playing with Z3’s Python bindings, but I started generating problem instances where Z3 took much longer and used much more memory than if I just exhaustively tested every possible solution, so I dug deeper into what it was doing. It turns out that Z3’s algebraic simplifier was actually making my particular bit-vector problems way more complicated, and bypassing that to just use the so-called “bit-blaster” and SAT solver got me results in reasonable amounts of time.

However, although the bit-blaster had support for bitwise OR and bitwise NOT operations on bit-vectors, it was missing support for bitwise AND. People wouldn’t usually notice because the simplifier rewrites AND in terms of combinations of NOT with OR, but since I was trying to avoid the simplifier I hit this unusual corner.

I had to dig into the source code to figure out that the error I was getting was about the AND operator, and in the process I found that it didn’t look too hard to fix. So I wrote and submitted my first two Z3 pull requests—one to add the four lines of code needed, and another to improve the error message in case somebody else ran into a different missing operator. Nikolaj merged both patches impressively promptly.

But that experience got me looking more closely at how Z3’s bit-blaster works, and I found some unreachable code in its implementation of multiplication. I started investigating how hardware engineers design digital circuits for multiplication, and came up with a patch that completely rewrote the bit-blaster’s multiplier. Naturally, there’s a lot I don’t understand about how Z3 works and so I had some bugs, but Nikolaj was again very helpful and responsive.

When Nikolaj tested my patch’s performance across the SMT-LIB benchmark suite for “quantifier-free bit-vector” (QF_BV) problems, it showed an average 20% speedup across those benchmarks, which I was pretty excited about! But it also made some of the benchmarks slower by orders of magnitude, so Nikolaj politely encouraged me to dig into it more carefully and offered tips that got me unstuck several times.

On some of the benchmarks that got much worse with my multiplier, after a lot of work it turned out that the only significant difference between the two implementations was that mine had some extra NOT gates floating around inside the tree of Boolean expressions. They were logically equivalent, in that pairs of these extra NOT gates would effectively cancel each other out, but that was pushing Z3 out of its comfort zone.

At that point I was back to looking at Z3’s simplifier. I made a small change to how it handled NOT gates mixed into trees of XOR gates, such as the bulk of the adders in a multiplier, and suddenly those benchmarks were fast again.

Nikolaj wanted to see the effect of that change independently of the rest of my changes and wrote a much more comprehensive patch. That change by itself made the QF_BV benchmarks an average of almost 40% faster, improving some benchmarks by an order of magnitude and reducing the number of benchmarks that couldn’t finish within 10 minutes by 10%.

It turned out that with the simplifier fixed, my rewritten multiplier was actually slower, on average, than the old version, but it happened to generate cases the old simplifier was good at in enough benchmarks to bring up the average.

I’ll be honest, I was disappointed, because my multiplier was pretty cool, if you think multiplication is cool. But as consolation prizes go, I’m definitely satisfied by getting to say that I found a 40% speedup opportunity in a tool that’s already as tuned for performance as Z3 is. I’m also very pleased with the experience of working with Nikolaj on this. We did fix up the unreachable code I’d originally spotted, too.

And then Nikolaj nerd-sniped me by suggesting that there could be even bigger gains from replacing parts of the algebraic simplifier with something based on “cut-set enumeration”, which I’d never heard of. I’ll summarize it as splitting a complicated Boolean function into every possible combination of simpler functions and deciding which of those combinations is best. So now I’ve spent the last week reading up on that topic and “logic synthesis”, and thinking very hard about Boolean truth tables.

The truth table stuff is what I want to share next, partly because I’m still thinking about the cut-set stuff. But mostly, I needed to know how much compression is possible of a lookup table keyed by Boolean functions, before I could fully evaluate whether anyone’s methods for splitting a complicated Boolean function into many simple ones seemed worth doing.

Finding functionally equivalent Boolean circuits

There are 2^N rows in any truth table for a circuit with N inputs. That means that the number of possible truth tables for circuits with N inputs is 2^(2^N). So there are 256 possible 3-input truth tables, 65,536 4-input truth tables, and more than 4 billion 5-input truth tables. These numbers grow way too quickly.

Fortunately, a huge fraction of those truth tables are equivalent up to a few simple transformations, called “NPN” classes: Negating some of the inputs, Permuting the inputs, and Negating the output. The 65,536 4-input truth tables only have 222 NPN equivalence classes, so if you have a procedure which can take any truth table and tell you which NPN class it’s in, then you’ve dramatically reduced the number of cases you have to handle.

For example, I’m looking into precomputing and storing optimal Boolean circuits implementing each of those 222 representative 4-input truth tables, even if doing the same for all 65,536 possible truth tables takes too much space.

Actually, I claim there are only 208 “interesting” 4-input truth tables; the other 14 ignore one or more of the inputs, so they’re equivalent to a smaller truth table. I want to always use the smallest truth table I can to represent each function. That said, you can store all sizes in the same lookup table by adding ignored inputs back on the left, if you want to.

The paper I’ve found most helpful in understanding these ideas is “Fast Boolean Matching Based on NPN Classification” by Zheng Huang et al. (I found it from the citations in “Fast Hierarchical NPN Classification” by Petkovska et al, which is great but assumes you understand the previous work in this area first.)

The problem is that for a given N-input truth table, there are N!*2^(N+1) different ways to negate and permute it. To identify which NPN class a truth table belongs to, you need some deterministic procedure that always picks the same representative truth table among the equivalent possibilities. For larger N, checking all those possibilities gets expensive.

Heuristics

The good news is that there are fast heuristics which are almost as good as checking all the possibilities. Instead of identifying a single representative truth table for every element of the NPN class, different truth tables within the class may lead to any of a few different “semi-canonical” representatives. Among the interesting 4-input truth tables, where there are 208 NPN classes, the heuristics in “Fast Boolean Matching” identify 1,630 representatives, and I found easy tweaks that reduce that to 812 representatives.

But I didn’t have any idea why any of the heuristics should work until I spent several days fiddling with my own implementation of them, so here’s what I learned.

Truth tables as numbers

First, note that they’re interpreting each truth table as a number. If you already know how this works, bear with me; I’m going to be explicit about the parts I had to keep thinking through.

Truth tables are a way to describe the behavior of a Boolean function, where you write out every combination of the possible inputs as well as what the output of the circuit is for each combination. So “r = if C then T else F” has this truth table:

We can treat the result column as a single unsigned number written out in binary, where the inputs are used as an index into the bits of the number. If the all-zeros input is treated as the least-significant bit of the number, then this if-then-else table is represented by 0xCA. I believe the paper used the opposite convention, although it wasn’t explicitly stated, so they would represent this table as 0x53.

Thinking of truth tables as natural numbers means we can impose a total order on them, and that suggests one way we could define the representative for an NPN class: pick the lowest-numbered truth table from that class. (Picking the highest number works too, or any minimal or maximal element of some other total order, so long as we do it the same way consistently.)

These two choices affect each other, in that the notion of which truth table is numerically less than another one depends on how you turn it into a number. I got mediocre results until I started mentally swapping “negative cofactor” and “positive cofactor” everywhere I saw them in the paper, which is why I think they picked the opposite bit-order convention compared to the one that makes sense to me.

Output polarity

The first heuristic decides whether to negate the output of the truth table. It does this by choosing whichever way around has fewer 1s in the result column of the truth table.

If we want to pick numerically smaller truth tables, to a first approximation that means we want as many 0s in the most-significant bits as we can get. Negating and permuting the inputs allows us to rearrange what order the rows of the truth table are in, but negating the output is the only way to change how many 1s and 0s there are total. So it makes sense to try to minimize the total number of 1s first.

If the truth table has the same number of 0s as 1s, such as in the if-then-else table above, then the paper says something I don’t quite understand about trying both choices. I’ve chosen instead to break ties by picking whichever option has the minimum numeric value.

Input polarity

The next heuristic decides, for each input, whether it’s better to negate that input or leave it alone. It does this by choosing whichever way around has fewer 1s in the “positive cofactor” than in the “negative cofactor”. (Actually the paper has that the other way around, but see above.)

I didn’t know what “cofactor” meant. I’ll go into more detail below in the “implementation details” section, but the essential idea here is: Count how many rows have a 1 in the result when input K is 1, versus how many when the same input is 0. These are the 1s in the positive cofactor, or negative cofactor respectively, with respect to input K. Negating an input swaps its positive and negative cofactors.

Again, we’re trying to get numerically-lower tables, so the plan here is to move some of the 1s from the most-significant bits toward the least-significant bits.

The handy thing about this heuristic is that negating an input doesn’t change the number of ones in the cofactors of any other input, so we can apply this heuristic on the inputs in any order and get the same result.

If an input has the same number of 1s in both cofactors, the paper says to leave that input unchanged. I tried that, but then I also tried breaking ties by picking whichever option has the minimum numeric value. I found that this tie-breaker cut the number of semi-canonical representatives in half, which is awesome for something that’s so quick to check.

Input permutation

The final heuristic picks an order for the inputs, once again by comparing the number of 1s in the cofactors. In the previous step we established that every input has no more 1s in its positive cofactor than in its negative cofactor.

Now we sort the inputs so that the input with the fewest 1s in its positive cofactor is the left-most input. (Again, the paper says to use the negative cofactor instead, due to that different choice of bit-order conventions.)

Similar to the previous step, swapping inputs doesn’t affect the cofactors of any other inputs, so this can be done in any order while getting the same result.

This one is hardest for me to explain. Let’s bring back a 3-input truth table and look at the positive cofactors visually:

The left-most input has its positive cofactor entirely in the bottom half of the table. The middle input has its positive cofactor split into two groups of rows, where the bottom half is still at the very bottom of the table, but the other part is higher up. By contrast, the right-most input spreads its positive cofactor as far across the table as possible.

We’re trying to get numerically-smaller truth tables, which means moving 1s from the most-significant (bottom) rows toward the least-significant (top) rows. So any input which has a lot of 1s in its positive cofactor should be more toward the right, thereby spreading those 1s toward the top of the table.

That said: After the second heuristic, if there are a lot of 1s in the positive cofactor, then there are at least as many in the negative cofactor, and moving a column to the right spreads those 1s toward the bottom of the table. So figuring out whether this helps is complicated.

So when some inputs have the same number of 1s in their positive cofactors, I break ties by sorting in decreasing order by number of 1s in the negative cofactors. On the 4-input truth tables, this improved the results from 868 semi-canonical classes to 812, and I expect it has more effect on larger tables.

Evaluation

These three heuristics don’t quite get us to the original goal of a minimum numeric value truth table. One example illustrating this is that, if the first two steps left a 1 in the most-significant (bottom) row, there is no permutation of the inputs which will move that row.

Still, in my experiments with 3-input truth tables, there are 10 NPN classes that don’t ignore any inputs, and only 13 semi-canonical classes after applying these three heuristics, so the heuristics are almost perfect at that size. On 4-input truth tables, there are 208 interesting NPN classes and 812 semi-canonical classes, which is still small enough to be a reasonable table-lookup. I didn’t write my prototype efficiently enough so I haven’t had the patience to exhaustively check larger truth tables yet.

Implementation details

If you want to go implement this idea, here’s a bunch of stuff I figured out, about the details that academic papers leave out as exercises for the reader.

There are various useful operations we can do on truth tables represented as numbers.

Note that I number inputs starting from 0 on the right, so in the if-then-else truth table, F is input 0, T is input 1, and C is input 2.

Add ignored inputs

One operation is to extend an N-input truth table to an N+1-input truth table, where the new input is ignored. For example, the 1-input truth table for “NOT A” is:

A | !A
0 | 1
1 | 0

If we introduce a new ignored input “B” to the left of A, then the truth table for “NOT A” becomes:

B A | !A
0 | 1
1 | 0
0 | 1
1 | 0

This is just duplicating the original truth table. On the numeric representation, this can be computed efficiently as:

table | (table << (1 << N))

Applying this operation repeatedly allows adding any number of ignored inputs.

Evaluate cofactors

The “negative cofactor” of a Boolean function, with respect to input K, is the subset of that function’s outputs where input K is false. Similarly, the positive cofactor is when input K is true.

A bitmask that is 1 everywhere input K is 0 can be quickly computed as the truth table for the circuit “NOT K”, which helps in looking at the negative cofactors of a truth table. (The reverse, for the positive cofactors, is the bitwise-NOT of that bitmask, or equivalently the bitmask shifted left by 2^K bits, whichever is more convenient to compute.)

If K is the left-most of the N inputs in this truth table—and assuming arithmetic is being done in two’s-complement, as all current computers do—then its negative cofactor bitmask can be efficiently computed as:

(1 << (1 << K)) - 1

If K is not the left-most input, then build the bitmask for a truth table where K is the left-most input, and then follow the above procedure to add ignored inputs for each input to K’s left.

We’re going to use these negative cofactor bitmasks a lot, so as a practical matter it may be worth precomputing these masks for every size of truth table you plan to deal with, and every input at each size.

Actually, as long as you’re careful to ignore bits outside of your current size of truth table, you can use any mask for size N with smaller truth tables, so you may just need one bitmask for each possible input at your largest table size.

Count 1s

Counting the number of bits which are set within a word is a common enough operation, sometimes called “population count” or “popcount”, that x86 CPUs have an instruction specifically for this purpose. There are also efficient bitwise-parallel methods which, like many “bit twiddling” hacks, are quite puzzling until you figure out the trick.

Anyway, counting the 1s in a cofactor of an input is just a matter of bitwise-AND between the truth table and the appropriate bitmask, then applying a suitable popcount implementation.

Swap subsequences of rows

Several of the more complex things we can do to a truth table can be described as re-ordering the table’s rows. In general this would be a tedious thing to do to our numeric representation: the usual bitwise and arithmetic operators don’t allow swapping bits in arbitrary permutations.

Happily, we’re only going to need certain kinds of re-ordering, such as “swap rows 1 and 3 with rows 4 and 6,” and we can do that efficiently.

This procedure will take two parameters: a bit-mask for the less-significant bits to swap, and an offset for how many bits to shift that mask left to find the more-significant bits to swap with. In the example above, the mask is 0b1010 (identifying rows 3 and 1), and the offset is 3 (row 1 swaps with 4, and 3 swaps with 6, so both are 3 rows apart).

tmp = (table ^ (table >> offset)) & mask;
table ^= tmp | (tmp << offset);

I based this on “Swapping individual bits with XOR” from Sean Eron Anderson’s page of “Bit Twiddling Hacks”. I’ve extended Sean’s procedure to handle non-contiguous subsequences of bits, and to hide one of the offsets in the other two parameters, because it turns out to be more convenient that way for this.

Negate the circuit’s output

If we want to take the numeric representation of a truth table and get a table representing the same Boolean function but with the output inverted, just bitwise-NOT the table. In C-like languages,

~table

If you’re storing truth tables in machine words that are wider than the size of the table, you may want to clear any bits that aren’t part of the table after you do this.

Negate the circuit’s inputs

We’re going to want to take a truth table and get a table representing the same Boolean function but with one of the inputs inverted. Another way to say that is we’re swapping the negative and positive cofactors of that input. So we can combine two earlier procedures to do this.

In this case, the mask is the negative cofactor bitmask for input K, and the offset is:

1 << K

We could pass that mask and offset to the procedure for swapping rows, but in this case where every row is moving, it’s probably more efficient to do this instead:

((table & mask) << offset) | ((table >> offset) & mask)

Swap two of the circuit’s inputs

When swapping inputs J and K, the rows where both inputs are equal don’t change. We just need to swap the rows where J is 0 and K is 1 with the rows where J is 1 and K is 0.

Assume J is the left-most variable of the two; if it isn’t, run this procedure with the parameters the other way around.

The mask for this swap is the bitwise-AND of the negative cofactor mask for J (the rows where J is 0) with the positive cofactor mask for K (the rows where K is 1).

The offset is the difference between the index of the first 1 in J’s column and the first 1 in K’s column in the truth table. This is:

(1 << J) - (1 << K)

For 3-input functions like the if-then-else example truth table above, there are three possible input swaps.

Swap C with F: mask is 0b00001010, offset is 3
Swap C with T: mask is 0b00001100, offset is 2
Swap T with F: mask is 0b00100010, offset is 1

In general there are N*(N-1)/2 possible input swaps for N inputs.

Note that any permutation of a circuit’s inputs can be described as a series of pairwise swaps, so applying this procedure multiple times allows arbitrary permutations.

Combining truth tables

Say you have the truth tables for arbitrary circuits A and B, and you want the truth table for an OR gate with the outputs of A and B as its inputs.

If A and B have the same inputs in the same order, then simply combining the numeric representation of their truth tables with bitwise-OR does the trick. The same works for any bitwise operator such as AND or XOR. (This also works for operators with three or more inputs, like if-then-else, but there aren’t any bitwise ternary operators in common programming languages so that’s not super helpful.)

If A and B have different inputs, though, first you need to fix that. You can use earlier procedures to add ignored inputs and then reorder the inputs until both circuits match. This is probably easiest if you impose a total order on your inputs and keep all your truth tables’ inputs sorted according to that order.

Detecting and removing ignored inputs

Every additional input doubles the size of a truth table, so it’s useful to remove any input from a Boolean function so long as doing that doesn’t change the function’s behavior. This is an especially good idea after you’ve combined truth tables as above, because many Boolean algebraic simplifications fall out automatically. For example,

When the input circuits compute X XOR Y and X XOR Z, combining their truth tables with XOR produces the same table as Y XOR Z, with X ignored.
When the input circuits compute X and NOT X, combining them with AND produces the constant FALSE, and combining with OR produces the constant TRUE; either way, X is ignored.

A simple way to check for this condition is to try negating some input using the earlier procedure. If the truth table doesn’t change, then that input has no effect on the function.

Here’s a slightly more efficient method to check if input K is ignored, given the negative cofactor bitmask for K:

(table ^ (table >> (1 << K))) & mask == 0

If input K is ignored, you can swap it with the left-most input, then discard the most-significant half of the table, which at that point will be equal to the least-significant half. This is the reverse of the earlier procedure for adding an ignored variable.

Enumerating all NPN classes

I started investigating this topic partly because I wanted to know how much I could compress a list of truth tables, and one way to frame that question is “how many NPN classes are there?” So I needed a way to count them. But when I learned that finding a perfect representative of an NPN class can be slow, then I wondered how close a heuristic approximation can get, so I needed to count those too.

Earlier I mentioned “Fast Hierarchical NPN Classification” by Petkovska et al, which is all about optimizations you can apply if you’re trying to find the canonical NPN class for many different truth tables. That’s definitely applicable to the case of counting all NPN classes, but if you do it exactly as they describe then their method needs space proportional to the number of possible truth tables, which is prohibitive for even 5-input tables.

I came up with a different method suited specifically for this counting task, similar to the Sieve of Eratosthenes for finding prime numbers. This relies on a fast implementation of the three heuristics.

Since we’re picking the numerically-smallest truth table as the canonical representative of its NPN class, start with the truth table with numeric value 0 and check successively larger truth tables.

Skip any truth table that ignores any of its inputs—including table 0, which is the constant FALSE and ignores all of its inputs. This is an especially fast test, so do it first. You don’t have to do this, but the input permutation heuristic as I’ve described it here won’t notice that the ignored inputs don’t matter, so these cases may have more semi-canonical representatives than they would if you ran this same process on the smaller table size with the ignored inputs removed.

Next, check whether this truth table is a semi-canonical class representative. If it isn’t, then it definitely is not a canonical NPN class representative, so skip it. The easiest way to check this is to run the truth table through the heuristics and see whether they return the same truth table unchanged.

Otherwise, when this truth table is a semi-canonical class representative, we need to check whether we already saw any smaller semi-canonical representative in the same class. There are two ways to do this, depending on whether you’d rather save memory or save time:

We could generate all negations/permutations of this truth table and check whether any of them are less than the current candidate, which is slow but makes the algorithm use a constant amount of memory. (I think?)
When we definitely have a canonical NPN class representative, we can generate all negations/permutations (which by definition are all greater than or equal to this truth table), filter them down to only the semi-canonical ones, and add those to a set of known non-canonical truth tables. Then, a later candidate is canonical if it is not in that set, and we can remove elements from the set as we pass them. This is faster but it’s hard to put bounds on its memory usage.

Conclusion

I’ve had a great time exploring how Z3 works and interacting with its maintainer via GitHub, and I’m learning a lot.

I’ve also learned a ton in the last few days about methods for identifying Boolean functions that are equivalent to each other. I’ve read three papers on this subject, and there are… more, so I’ve probably only scratched the surface.

But just trying to understand those papers required a lot of background I didn’t have, so in this post I’ve tried to distill all that information into a summary that might help someone else get started.

In addition, I’ve described some improvements I found that weren’t in the papers I read, and many bit-twiddling hacks to make a practical implementation as fast as possible.

How not to improve video conferencing

2021-04-07T12:30:00-07:00

Over the last year, a lot more people have been using video-conferencing platforms, such as Zoom, because they couldn’t safely meet in person.

I don’t like siloed platforms like Zoom. I don’t want to devolve into a rant here, so I’ll just say: I believe that systems should always work in a federated way if that’s feasible, like how people at gmail.com can email people at outlook.com even though those are operated by completely different companies. And we’ve had email-like standards (SIP) for voice, video, and chat for over two decades; apparently, cell phone carriers have been using them for most of that time, and the current iteration (VoLTE) is actively deployed in 104 countries/territories around the world. So it’s not like we don’t know how to make the technology work.

Recently, I tried to counter my own biases and ask: is there a reason that I consider justified to build a non-federated conferencing service? In particular, what impact do architectural choices have on costs and efficiency for the various participants?

While considering these questions, I dug through a number of research papers on cryptography and on networking, to see whether academia can offer improvements for video-conferencing that industry hasn’t picked up yet.¹

I learned a lot, but ultimately what I can report today are negative results. I’m sharing the ideas I considered in the hopes that you may be inspired to apply some of these research results to some setting where they’re more appropriate. In addition, I’m sharing the process I followed to ultimately reject those ideas in the hopes that you may find the reasoning helpful in your own investigations.

Secure audio conferencing

Audio-only conference calls have an interesting property: it’s possible to mix everyone’s audio into a single stream. With video calls, each participant needs to receive a copy of every other participant’s video, which imposes practical limits on how many people can join a video conference. But no matter how many people are sending audio, if a central server handles mixing all the audio streams together, then each participant’s bandwidth usage is constant.

That sounds like a good justifiable reason to centralize audio-only conference calls, at least.

But I don’t want that centralized service to be able to listen in on my calls. So is there a way to send encrypted audio through the server and let it mix the audio together but without being able to decrypt it?

The answer is yes! In fact while discussing this with some friends, one of them reminded me that a team at Galois, where I used to work, built a prototype of exactly this application during their work on ShareMonad; they wrote about the experience in Launchbury et al’s paper, Application-Scale Secure Multiparty Computation. Their approach requires that you have three servers which you trust won’t share data with each other, which presumably means they’re operated by independent entities and maybe in different legal jurisdictions; as a result, the deployment story seems awfully complicated. Also I think if they’d used the approach I’m about to describe, their implementation would have been simple enough to not make for an interesting academic paper. 😅

Mixing audio fundamentally comes down to taking a sample from each source and adding them together. Because it’s just addition, there are a lot of tricks we can play. A variant of the ShareMonad-style approach, for example, splits a sample into multiple “shares”, where each share is random but the sum of all the shares equals the original value. So each client sends a different share of their samples to each server; each server adds all the shares it received, resulting in one share of the mixed result; and all the servers send their result shares back to the clients, which add up the shares to find the final mixed sample². If all the servers pool together the shares they’ve been given, then they can learn what the original inputs were; but if just one server keeps its shares secret, then none of the servers learn anything about whatever is being discussed on the call.

Another approach is to use homomorphic encryption. In an additively-homomorphic cryptosystem, I can give you some encrypted numbers, and there’s some procedure you can follow to find the encryption of the sum of those numbers, even if I don’t give you the decryption key. The most-used additively homomorphic cryptosystem was published by Paillier in 1999, and simplified and generalized by Damgård and Jurik in 2001. Much of the research since then has been pursuing “Fully Homomorphic Encryption” which supports not just addition but also multiplication (or other pairs of operators), because there aren’t that many applications where just having addition is enough. But audio conferencing is one of those rare exceptions!

Modern cryptosystems generally have a security parameter, where higher numbers are expected to be secure against adversaries with more computing power. In the case of Paillier, if the security parameter is 2048 bits, then the plaintexts it can encrypt are up to 2048 bits and the resulting ciphertexts are 4096 bits.³ This is important because we want as high of a security parameter as we can afford, but at reasonable security levels we can’t do very many encryptions or decryptions per second, so we need to make the best use we can of each one.

So we need to pack multiple samples into a single plaintext. If we’re using 8-bit audio samples, then we can fit about 250 of them into a 2048-bit plaintext⁴. That said, if we add two audio chunks packed this way together, any sample could overflow into the least significant bit of the next sample. So we need to leave an extra zero bit in between each sample. In general, if we have N participants in the conference, then the number of padding bits we need is the base-2 log of N, rounded up. If we leave ourselves 10 bits for each 8-bit sample, then we can have up to four participants and about 200 samples per plaintext. Note that the server doesn’t need to know which bits are padding, so the participants can negotiate amongst themselves regarding the number of samples per plaintext and the number of bits per sample.

Up to this point I’ve assumed that the audio is uncompressed, but there’s one very common audio compression trick that works just fine in this setting: MDCT. Used by everything from AAC to WMA, the MDCT transforms a sequence of time-domain audio samples (plus half as many samples from before and after the current block) into an equal number of frequency bins. The neat thing is that adding two blocks together and then transforming to the frequency domain gives the same result as transforming both blocks and then adding them. So a server performing Paillier additions doesn’t even need to know whether the plaintexts contain time-domain or frequency-domain representations. To compress the transformed blocks, the participants just need to agree on how many bits to use for each frequency bin. For example, they might choose to use fewer bits for frequencies outside the normal range of human speech.

Estimating feasibility

Okay, so let’s estimate how much compute power this plan takes. If we assume 200 samples per plaintext and a sample rate of 8kHz, that means each participant needs to do 40 encryptions and 40 decryptions per second, and the server needs to do 40*(N-1) additions per second for N participants. To get a rough idea of how fast Paillier operations are, I ran the microbenchmarks for the Rust paillier-common crate on my laptop, a 2015 model with a 3.1GHz Intel Core i7 CPU.

In my completely-unscientific testing, one encryption and one decryption of 2048-bit plaintexts together take about 7ms, which is about 28% of the 25ms we have before it’s time to process the next 200 samples. So my laptop could probably handle this scheme, but a phone might not be able to.

The Paillier additions are very fast though, by comparison. My laptop CPU, which is presumably less powerful than a reasonable server, could handle something over 4,000 participants simultaneously on one CPU core.

So this seems feasible and I think that’s a bunch of really cool results, but…

Is this worth doing?

No. It’s good enough to send a separate audio stream for each person who is speaking; that may even be a superior approach. The recipients can mix the audio into a mono stream for playback, and the streams can be end-to-end encrypted using bog-standard symmetric-key cryptography even if they’re routed through a central server.

In a conference call, it isn’t practical to have many people talking simultaneously, because then nobody can make out what anybody is saying. As long as the maximum number of simultaneous speakers is small, the amount of bandwidth required to listen in stays small no matter how many listeners join.

Anyway, the amount of bandwidth needed for compressed audio is tiny. High-quality lossy compression for speech needs maybe 40kbps. Given the parameters I used above, my scheme would require 160kbps in each direction, so as long as there are fewer than eight simultaneous speakers, the simpler system is more efficient even with higher quality audio. Using an MDCT and fewer bits for some frequency bins helps, but it probably doesn’t make up that big of a difference.

I hope it’s clear that none of this means homomorphic encryption schemes, like the Paillier or Damgård–Jurik cryptosystems, are useless. Secure multiparty computation, like ShareMonad, isn’t invalidated as a research field by this result either. But I do think that audio conferencing isn’t going to be a fruitful application domain for any of these tools, because its challenges are better addressed using more straightforward solutions.

Efficient peer-to-peer video conferencing

Unlike audio conferencing, there are no shortcuts in video conferencing: if you want to send video from N people to some recipients, you need to send N times as much video data. Anyway I just got done arguing that the audio shortcut of mixing down to one stream isn’t actually useful, so anything we might say about video applies to some extent to audio too.

But maybe academic research can point the way to more efficient distribution of that video data across the internet?

Before we can investigate that question, we have to decide what we’re trying to optimize.

If you’re coming from the perspective of operating a video-conferencing platform like Zoom, you might look at the costs associated with routing all that video through your own servers and networks, and wonder whether a more peer-to-peer system might save you money.

If your perspective is you don’t trust video-conferencing platforms to keep your meetings secure, then you also might wonder whether you can convince the people you communicate with to use a peer-to-peer system.

On the other hand, if you operate some portion of the public internet, you might wonder whether there’s a way people could use less of your network for video-conferencing, leaving more room for whatever it was we did with the internet before the pandemic.

The bad news is that no matter what, there will be at least N*(N-1) video streams bouncing around for a video-conference where all N participants have their cameras turned on, because one way or another, each participant has to receive one video stream from every participant except themself. If the video is sent to a central server and redistributed from there, the minimum is N*N streams.

Note that since every stream has both a sender and a receiver, if you add up the bandwidth costs for all participants, they come to twice the number of streams.

The good news is that we can shift the costs of those streams around, and perhaps reduce the cost of the underlying infrastructure, by organizing the participants into different structures.

Centralized platforms

The simplest way to engineer a video-conferencing platform is to have every participant connect to a central server, which copies everyone’s video stream to everyone else. This is nice for the participants, who each need enough bandwidth for N video streams, which is the bare minimum. On the other hand, that central server faces N*N video streams, which places much harder limits on how many participants the platform can handle in one video conference. I might be able to handle one thousand simultaneous video streams on my gigabit fiber⁵, but then that poor server is facing one million streams for one conference.

The “one central server” plan also isn’t good for those folks operating the public internet. Imagine I placed that server in Europe but everyone in some conference call is in North America. Now there are N*N streams crossing the expensive trans-Atlantic cables even though exactly 0 of them are of interest to anyone on the European side of the ocean.

This isn’t the way any serious platform operates, clearly, but let’s set aside the centralized platforms for a moment and go to the other extreme.

Everyone sends directly to everyone else

The most straightforward way to make a video conferencing system peer-to-peer is to require every participant to open a direct connection to every other participant, with each sending the other their own video feed.

This option is, hands down, the most fair way to distribute the bandwidth costs: every participant needs enough bandwidth for 2*(N-1) streams. No more, no less. That’s up to twice what they needed in the centralized model, but on the other hand there isn’t anybody getting stuck with the dramatically higher N*N bandwidth costs.

Fairness isn’t necessarily a good thing, since participants often have different amounts of bandwidth available. Fairness could mean limiting the size of a video conference to the number of people that the least-well-connected participant has bandwidth for.

That said, direct connections open up the option of sending video encoded at a different bitrate for some recipients, at the cost of more CPU time spent on video encoding. So that tradeoff might pay off.

This option also may offer the lowest end-to-end latency, since each stream goes as directly as possible to its recipient. Or it may not, if some platform builds out a global network optimized for latency and can bypass networks that are optimized for throughput and cost. It’s fun to look at e.g. Zoom’s PeeringDB entry and speculate about what they’re doing with the peering exchange points that they have a presence in on every continent.

On today’s internet, not everyone can connect directly to everyone else, due mostly to NAT. But using direct connections by default and falling back to platform-provided TURN relays still cuts out a lot of the platform’s costs in the centralized model. In fact, operating a global network of TURN relays is currently a useful service in its own right; see, for example, Twilio’s Global Network Traversal Service. I just hope some day we don’t need such services any more.

Overlay multicast

There are often clusters of participants in different geographic locations—say, a few in San Jose, some others in Toronto, and let’s rope in the team in London while we’re at it. Maybe what we should do is pick one well-connected participant in each cluster who handles the long-distance video forwarding to one or two other clusters, and then that representative connects to everyone else in their cluster. If that’s too much for one of those representatives, then we can split that cluster into several smaller clusters, and the top-level representative just has to connect to the mini-cluster representatives.

Any structure like this can be seen as a tree, although in this context it doesn’t actually matter which node you think of as the root. Since every node except the root has a single parent, and every participant’s video is flowing either up or down the tree at every edge, this still results in N*(N-1) video streams, just like when everyone sends directly to each other. The difference is that an individual participant’s bandwidth requirements are now N*D, where D is the degree of that node, or in other words how many other participants it’s connected directly to.⁶

This lets us have a kind of “controlled unfairness”. Each leaf in the tree only connects to its parent, so it needs enough bandwidth for N streams, just like in the central server case. The interior nodes can connect to as many other participants as they can afford. Higher degree makes the tree more flat which decreases worst-case end-to-end latency but, of course, increases bandwidth costs for that participant.

The problem is that participants have incentives that run counter to doing what’s best for the group as a whole: in this context, they often want to minimize their own bandwidth costs. We have a lot of choices for how to structure the tree, but how do we know which ones people will actually use?

While I was researching this topic, I found a paper called “Repeated-Game Modeling of Multicast Overlays”, by Afergan and Sami, published in the proceedings of IEEE INFOCOM 2006. Their model was that only one source would be broadcasting and the root of the tree would receive the broadcast first, so they consider an additional incentive which is to be as close to the root as possible in order to get the lowest latency. (In fairness, that’s what most people mean by “multicast”; I’m abusing the terminology a bit because the same structure works here.) Although that incentive doesn’t apply here, most of their modeling and reasoning is still valid in this situation too.

I enjoyed reading this paper! It provided a convincing way to reason about individual incentives, and turned the results into system design principles for encouraging individuals to cooperate.

There are many other papers on the topic of “overlay multicast” which suggest different ways to structure these trees, so anyone who wants to go down this road has lots to work with. But…

Is this worth doing?

Specifically, should video conferencing avoid having a centralized platform but also avoid having every participant connect directly to each other?

This isn’t totally clear, but no, I don’t think so.

The first model, with a central server, is roughly the same as a tree where the root has all other participants as its direct children. I find this to be a useful way to think about how centralized platforms can scale to multiple servers: organize the servers in a tree, and let conference participants connect (as leaves of the tree) to whichever server is geographically closest to them.

Keeping the interior nodes of the tree under the control of a centralized platform means that all the incentives are aligned: the operators of those nodes care about the sustainability of the network as a whole, because it’s their network. They also have the ability to place those nodes wherever they’re most needed, rather than relying on the geographic distribution of the participants, and they have the ability to control the quality of the connections between those nodes.

In the peer-to-peer setting, requiring every participant to connect directly to each other doesn’t impose that much of a bandwidth cost. For two participants the cost is exactly the same; for three, it’s 33% higher. As the number of participants grows, the bandwidth cost that the peer-to-peer participants have to shoulder tends asymptotically toward twice the cost of using a central server.

Forming an overlay multicast tree is complicated because, as the repeated-game paper showed, protocol and policy choices must be made with the participants’ incentives in mind. Since the only advantage over direct connections is that some nodes save a little bandwidth by pushing part of their load onto others, I don’t think that complexity is justified.

Most of these considerations would go out the window if IP multicast had been widely deployed on the public internet. If the routing infrastructure at the IP layer supported sending only one copy of a stream across a particular link even when multiple recipients want it on the other end, then I would say everybody should just use that and call it a day. Every participant’s bandwidth costs would be from just N video streams, like in the central server case but without any central servers, and every video stream would cross any given network link exactly once. But the key there is that it has to be integrated with the underlying routing decisions, not bolted on as an application-level overlay.

sigh… maybe someday we’ll get an internet that does everything, but I’m not holding my breath.

Conclusions

N-to-N audio/video conferencing is tricky to make scalable, but there’s been significant demand lately. Given how much academic research hasn’t found industry adoption, it’s worth checking whether there are hidden gems that can drive the next big innovation. In this case, the research I evaluated didn’t yield any of those, which always feels disappointing.

But I hope you’ve found the process I went through interesting, at least, and perhaps picked up some tricks that you’ll be able to apply to other challenges you encounter some day.

Footnotes

Cryptography is a fascinating field with all sorts of strange and wonderful theoretical results. So it seems a shame that out of that entire field, the only things impacting most people’s life day-to-day are public-key encryption, symmetric-key encryption, and secure hash functions. Granted, these three tools have had a huge impact; all three are necessary to make the little “lock icon” in web browsers meaningful. (In addition, secure hash functions have a huge impact on the world’s electricity consumption via proof-of-work blockchains such as Bitcoin; most tools have both good and bad uses, after all.)

More generally, there’s a lot of academic research out there that has not found applications besides publishing more research papers. That doesn’t mean it was a bad idea to do that research! It may mean the right application for it just hasn’t appeared yet, or it may inspire other research that’s more immediately applicable, or it may just serve to show that people shouldn’t waste time doing that particular thing again. All of those outcomes are valuable. ↩
The ShareMonad implementation was made more complicated than this because they had the servers change between logarithmic and linear representations as well as clamping the mixed samples back into valid dynamic range. They also generate a different mix for each client, excluding that client’s input while mixing all the other streams together. But all that work can be done on the clients, either before encrypting their individual samples or after decrypting the mixed samples. The only work which requires seeing samples from multiple clients is the addition step, and that part is easy. ↩
The Damgård–Jurik cryptosystem supports a parameter called s, where the plaintext can be s times as long as the security parameter, and the ciphertext will be s+1 times as long, for a ciphertext expansion rate of (s+1)/s. Paillier is the special case where s is 1, so the ciphertext expands by a factor of 2 compared to the plaintext; higher values of s are more space-efficient but a little slower per bit.

This was not clear to me from the Wikipedia articles on the subject, so to get the whole story I read the paper “A Generalization of Paillier’s Public-Key System with Applications to Electronic Voting” by Damgård, Jurik, and Nielsen. That’s a fascinating paper in its own right, including a bunch of useful zero-knowledge proofs that are relevant to electronic voting. It also includes performance measurements for a 750MHz Pentium III, which were not super helpful for my purposes. ↩
Although 256 8-bit values fit in 2048 bits, a Paillier plaintext must be numerically less than the public key, which is the product of two 1024-bit prime numbers, and is therefore some number smaller than 2^2048. ↩
Yes, yes, I know, if I try to display 1,000 video streams on my screen, they’ll all be so small as to be useless. An alternate title for this post could have been “Asymptotic complexity isn’t everything.” ↩
I find it informative to consider some specific shapes that an overlay multicast tree may take on:

In the extreme case where every node except the last has exactly one child, the two nodes at the ends (the root and the leaf) have degree 1 and the longest possible delay from each other; everyone else has degree 2, so they need only slightly more bandwidth than in the case where everyone connects directly to each other.

If the tree is a balanced binary tree, then about half the participants have degree 1, while the rest have degree 3, except for the root which has degree 2. ↩

XCB: protocol specifications as data

2021-03-25T21:14:41-07:00

This year is the 20th anniversary of when I started writing XCB, and I’ve been thinking about how best to implement network protocols ever since. I have a lot to say on that topic but in this post I just want to tell the story of how XCB’s protocol specification language came to be.

This story begins over sushi lunch on the last Thursday of August, 2001. I’m not sure when Keith Packard and Professor Bart Massey began discussing how a C-language client library for the X protocol ought to work, but Bart had been passing around a paper abstract on the topic for a couple months by then. Most of XCB’s eventual design principles were already described in that abstract, even though no code existed yet. In particular, Bart wrote:

It is expected that [XCB] will be implemented as directly as possible from the X Protocol specification. It would be nice to at least partially automate the generation of the API implementation directly from the X Protocol document: this concept is currently being explored.

The person exploring that concept was me, and that Thursday I was telling Bart and Keith that I didn’t think I could get it to work. The existing human-readable documentation on the X protocol followed some consistent typographical conventions that seemed parsable, but after writing a small pile of Perl with limited success, I concluded that was a dead end.

Over lunch, Bart and Keith suggested that I could instead try writing new protocol specifications with code generation in mind. Specifically, Bart suggested that I use the m4 text preprocessor to express each part of the protocol as a macro invocation.

Of the people who have heard of m4, the most likely response to any mention of it is some combination of disgust or horror. I think most people have only experienced it in conjunction with either Sendmail configuration or autoconf, and in both cases m4 was used to paper over complexity by burying it in layers of complex macros. I don’t think it’s fair to blame m4 for that, though.

Anyway, I didn’t know any of that. So while Keith told Bart over the weekend that he couldn’t believe Bart would inflict m4 on me, and while Bart was starting to feel guilty about it, I was naively forging ahead and making it work.

XCB’s code generation continued to be based on m4 for the next three years, until Josh Triplett took Bart’s “Open Source UNIX Software Development” class and replaced all my beautiful m4 macros with a combination of XML and XSLT. Despite that, we became good friends. In turn, Josh’s XSLT was later replaced with Python, which is the combination that XCB still uses now.

Today, XCB consists of about 2,900 lines of hand-written C, and about 4,000 lines of Python that’s used only at build time. The protocol specifications total almost 19,000 lines of XML. By contrast, libX11 (which XCB was intended to replace) is about 93,000 lines of hand-written C, not counting the many libraries that implement various protocol extensions.

There has only been one CVE assigned to XCB, ever: CVE-2013-2064. By contrast, libX11 and its extension libraries have had over 50, in some cases because similar bugs were replicated across many of the libraries. This highlights the value of the Language-theoretic Security (LangSec) philosophy, which I only just learned is a specific thing people study. (See The Seven Turrets of Babel: A Taxonomy of LangSec Errors and How to Expunge Them for a comprehensive overview of security issues in protocol implementations.)

Besides reducing the amount of code to maintain and making the implementation more resistant to security bugs, making the protocol descriptions into pure data instead of code enabled new software to be built that was difficult to write before.

My favorite example is that in 2009, Wireshark’s protocol dissector for X was replaced with code auto-generated from XCB’s XML protocol descriptions. Before that, it could only dissect the core protocol, but this change allowed it to support all the X protocol extensions as well. As extensions have been added and changed over time, Wireshark has only needed to be rebuilt against newer xcbproto, and it immediately gained support for the new extensions.

Bart’s original stated goal, that we should automate the generation of the API implementation, really paid off. This was the first project where I saw the value of machine-readable protocol specifications. I’ve kept trying to replicate that result ever since, but I’ve never been completely satisfied with the available options for specification languages or methodologies.

I intend for this to be the first post in a series about protocol implementations I’ve worked on, where each one will highlight different aspects of desirable features for protocol implementers. Then I’ll wrap up the series with discussion of the connection between network protocols and an aspect of formal language theory which I recently became aware of, called “subsequential finite state transducers”.

Until then, I hope you enjoyed this peek behind the curtains of XCB’s development!

Quickly building minimal Docker containers with Nix

2021-02-02T17:12:00-08:00

For several years now I’ve been using Dokku on my personal web server as an easy way to spin up new web apps I want to play around with. Like the Heroku platform on which it’s modeled, Dokku allows me to deploy a new version of an app by simply running git push, but Dokku is fully open-source and I can self-host it.

I’ve written before that Dokku imposes constraints I don’t like, so as part of a plan to replace it with something that more exactly meets my needs, I wrote “The most general reverse proxy”. That only addressed a small portion of what Dokku is currently doing for me, though, and I’m starting to think I need to write my own service manager to capture the rest; neither systemd nor its competitors seem to have any support for rolling deployments.

Meanwhile, I’ve also been working on addressing the things I find most painful about Dokku as I continue using it. And I’ve concluded that what frustrates me most in routine usage of Dokku is all inherited from Heroku’s design, by way of the Herokuish compatibility layer that Dokku uses by default.

Let’s be clear here: both Dokku and Heroku are great tools, and if you do any web development, or just want to self-host web apps that others have developed, you should definitely try them out. But it’s always worth trying to improve things, and I’ve come up with a bit of a hack that’s already working much better for my needs.

How it works now

The core of my complaint is the method that Heroku, and by extension Herokuish, uses to construct an application container image.

Every time I run git push, the new version of my app needs to be combined with all its dependencies into a bundle that the hosting platform can execute. The difficulty is in how I as an app developer should specify those dependencies in such a way that the platform can find and install them for me.

Heroku uses two different approaches for this. First, Heroku provides a base image containing a stock Ubuntu install and a collection of commonly-used packages, such as compilers and database libraries. (See the official Heroku Stack Images for details.) This stack image is around 1.5GB of stuff that a lot of people need some of the time, but of course for any given app the vast majority of it is unused.

After that, buildpacks provide language-specific autodetection of dependencies.¹ For example:

the Python buildpack looks for common Python config files like requirements.txt or Pipfile.lock and installs whatever Python packages those files specify;
the PHP buildpack looks for composer.json,
the Node.js buildpack looks for package.json,
the Ruby buildpack looks for Gemfile.lock,

and so on. Heroku has a collection of officially-supported buildpacks but other developers can write their own and share them. If one or more of the 7,246 existing buildpacks applies to your project, this makes getting started really easy: you don’t need to understand anything about how Heroku works, only how to use your chosen development environment’s tools. It’s magic!

If neither the stack image nor any of the available buildpacks does what you need, though, then you fall off a complexity cliff. Suddenly you need to understand the filesystem layout of a Heroku container, think about which version and architecture of Ubuntu it’s based on, and dig deep into what the buildpacks were doing for you up to that point. It’s too much magic.

Buildpacks are specialized enough that people fall off that complexity cliff all the time, as evidenced by the fact that there are over seven thousand buildpacks. That’s an average of two or three new buildpacks every day since Heroku announced buildpacks in 2012.

That shallowly-buried complexity was already enough to make me want something better, but what really bothers me every time I run git push is how slow the build process is. I think there are two major causes for this:

It’s entirely sequential, which can’t be fixed without collecting information about the dependencies between build steps. This is the usual situation for Docker images in general, not just Heroku.
Although buildpacks have always been allowed to cache build products from one deploy to the next, it’s up to the buildpack author to use the cache correctly, and that’s hard to get right. I suspect buildpack maintainers who try to implement really aggressive caching are punished by more bugs and headaches than it’s worth to them.

Last and probably least, it bothers me that there’s so much unnecessary stuff in the final deployed image. Programs which are only needed at build time, such as C compilers and static site generators, remain in the image when deployed; and then, of course, there’s the 1.5GB of mostly-unused Ubuntu packages. Because Docker only stores one copy of the stack image and shares it between all the apps which use it, this is more of an aesthetic consideration rather than a real problem. That said, if your app has a security hole, this is a huge attack surface which can make exploiting bugs easier.

I have a fix for all of these problems… in exchange for creating new ones, naturally.

Building minimal Docker images quickly with Nix

I’ve started using the Nix package manager to build the Docker images that I deploy with Dokku. Most people aren’t familiar with Nix, so I’m going to give a little background, but since this blog post isn’t really about Nix, I’m going to oversimplify a bunch of details. If you are familiar with Nix, please just ignore the inaccuracies. kthx 😅

Nix provides a unified way to describe the complete build-time and run-time dependencies of arbitrary software. Instead of writing giant shell scripts to sequence all your build steps, Nix gives you a programming language that treats build products as first-class objects. The result of running a Nix program (a “derivation”) is the dependency tree of all the steps needed to build the specified software from scratch. That dependency tree can then be “realized” to run all the steps and complete the build.

Nix enforces that a derivation’s dependencies are specified correctly, by building it in an environment where only the requested versions of those dependencies are available. As a result, Nix can automatically and reliably determine whether two builds can proceed in parallel and whether a previous build of a derivation can be reused because its inputs haven’t changed.

So Nix builds get parallelism and aggressive caching for free. In fact, it has built-in support for distributing builds across, and sharing caches between, multiple computers.

Of course, somebody has to write all those Nix recipes, and it’s easiest if you can just reuse recipes that somebody else already wrote. The usual source for those is the Nix Packages collection, Nixpkgs, although various communities also provide package “overlays” that extend Nixpkgs.

But just like having thousands of buildpacks available still doesn’t cover every use case, sometimes I’ve needed something that wasn’t packaged.² Compared to writing buildpacks, though, Nix build recipes have a much more gradual learning curve,³ and the Nixpkgs manual offers extensive documentation for the many different helpers that Nixpkgs maintainers have come up with over the years.

Once you have a recipe that builds your application, Nixpkgs provides an easy way to turn that into a Docker image. For example, I build this blog with a Nix expression that looks something like this, where pkgs is an instance of Nixpkgs and config is the result of another derivation that writes the lighttpd.conf I want:

pkgs.dockerTools.streamLayeredImage {
  name = "jamey.thesharps.us";
  config.Cmd = [ "${pkgs.lighttpd}/bin/lighttpd" "-D" "-f" config ];
}

Nix builds are isolated so it can’t load the image into Docker directly, so the result of that recipe is actually a shell script which constructs the Docker image. To load it into Docker, I run:

$(nix-build) | docker load

I generate my lighttpd config file with a recipe like the following, where docroot is the result of yet another derivation which runs Jekyll to build the static HTML of this blog. One neat thing is that the checkPhase script will fail the build if lighttpd can’t parse the configuration, rather than having lighttpd fail to start after I’ve already deployed the broken configuration.

pkgs.writeTextFile {
  name = "lighttpd.conf";

  checkPhase = ''
    PORT=5000 ${pkgs.lighttpd}/bin/lighttpd -tt -f $n
  '';

  text = ''
    server.document-root = "${docroot}"
    server.port = env.PORT
    include "${pkgs.lighttpd}/share/lighttpd/doc/config/conf.d/mime.conf"

    etag.use-mtime = "disable"
    setenv.set-response-header = ("Last-Modified" => "")

    # more settings follow...
  '';
}

This illustrates one “gotcha” of using Nix, which sets the mtime of all files to 0. That means mtime is not useful in ETag or Last-Modified headers. Cache validation can still get correct results because all the inodes should change on every rebuild, so the generated ETag should change if any content changes. That invalidates more cache entries than necessary but at least it’s sufficient for correctness.

Since config.Cmd references both the lighttpd binary and my config file, both are automatically included into the image. In addition, lighttpd needs some libraries, and the config file references the Jekyll-built HTML, so all of those are included too.

But after the HTML is generated I don’t need Jekyll or Ruby any more, so those are not added to the image. Only the derivations which are transitively referenced are included, so there’s nothing extraneous being added at all. The Docker image for this blog is about 42MB, instead of 1.5-2GB, and five-sixths of that comes from glibc and OpenSSL, which aren’t worth getting rid of.⁴

So switching to Nix addresses all the complaints I mentioned earlier: it has a gradual learning curve rather than a complexity cliff, it builds images more quickly, and the images it builds are as small as possible.

Integrating Nix with Dokku

Now for the part which is currently more of an ugly hack than ready for production. The first time you git push to a new Dokku-hosted repo, Dokku creates a git pre-receive hook which looks something like this:

#!/usr/bin/env bash
set -e
set -o pipefail

cat | DOKKU_ROOT="/home/dokku" dokku git-hook jamey.thesharps.us

All the work of building a Docker image using Herokuish is done by Dokku’s internal git-hook trigger.

To make git push trigger a Nix build instead, I replaced that with this script:

#!/usr/bin/env bash
set -e; set -o pipefail;

export NIX_PATH="/nix/var/nix/profiles/per-user/root/channels"
export PATH="/nix/var/nix/profiles/default/bin:$PATH"
export DOKKU_ROOT="/home/dokku"

app=jamey.thesharps.us

while read old new ref; do
	if ! test "$ref" = "refs/heads/master"; then
		echo "skipping ref $ref--the deployment branch is master" >&2
		continue
	fi

	script=$(nix-build --expr "import \"\${builtins.fetchGit {
		url = ./.;
		rev = \"$new\";
	}}/docker.nix\" {}")
	image=$($script | docker load | sed -n '$s/^Loaded image: //p')
	docker image tag "$image" dokku/"$app":latest
	dokku tags:deploy "$app"
done

This new script does the following steps:

Tell Nix to find and evaluate the docker.nix file in the newly-pushed revision. This builds all the dependencies and the image-generating shell script.
Generate the image, load it into Docker, and find out what tag it got assigned. The same commit might get deployed to multiple different apps, so it shouldn’t have the app name hard-coded into it, and I’m letting the Nixpkgs docker tools generate the image tag from a hash of its dependencies.
Apply a new tag to the image using the naming convention that Dokku expects. For example: dokku/jamey.thesharps.us:latest
Finally, tell Dokku to deploy the tagged image.

I’d rather do all this in a Dokku plugin, but I dug around the source code a bit and didn’t see any way that plugins can provide alternatives to Herokuish or Dockerfile deployments. If I have enough energy for it, at some point I’ll open an issue about this on the Dokku repo.

Dokku supports deploying Docker images built on a CI server (as does Heroku, I gather), so as an alternative I could have set up a CI server that I git push to and which deploys the build result into Dokku.

But this hack works for me, and I’ve switched most of the apps on my personal server over to it.

Conclusion

Switching from Herokuish to Nix while still deploying using Dokku is a pretty big change. For me, it’s already paid off several times over: I can concisely express exactly the configuration and dependencies that an app requires, and I feel much more comfortable deploying now that the build step is usually seconds instead of minutes.

I’m less happy about the method I’ve used to glue Nix to Dokku. But Dokku is being actively developed, so I have hope that at some point the maintainers will introduce more flexible triggers that allow plugins to implement alternative build methods like this.

I don’t see any reason Heroku or other similar platforms couldn’t offer Nix builds as an option, for that matter.

Nix may not be the right tool for you. Dokku and Heroku may not meet your needs either. But I think they’re all tools worth knowing about, at the very least.

Appendix: Language-specific examples

The original purpose of the buildpack abstraction was to make it easy to get a container from source code written in a wide range of programming languages. So if we replace buildpacks with Nix, how much work does it take to get started?

The answer varies a lot from one language to the next, but here are a couple of examples from the deployments I’ve done using Nix recently.

Python

A new tool that’s made deploying my Python projects almost trivial is poetry2nix, which gets all of the packaging information it needs from the poetry.lock file written by the Python Poetry package manager. I can run a command like poetry add --lock gunicorn to declare a new dependency on a Python package and the Nix build immediately picks it up, using the exact version that Poetry’s dependency resolver selected.

If you’re using either the unstable version of Nixpkgs or the overlay that the poetry2nix developers provide, you should be able to build a working Docker image from any Poetry-managed project with a recipe like this one, which I’ve simplified from the docker.nix in my predictable project:

let
  pkgs = import <nixpkgs> {};
  app = pkgs.poetry2nix.mkPoetryApplication {
    projectDir = ./.;
  };
in pkgs.dockerTools.streamLayeredImage {
  name = "predictable";
  contents = [ app.dependencyEnv ];
  config.Cmd = [ "/bin/gunicorn" "predictable:app" ];
}

The best part is that poetry2nix builds each Python package as a separate Nix derivation. So unlike standard Python tools like Pip which install all packages into one site-packages directory, this means

multiple Python packages can build in parallel,
only Python packages which have changed need to be rebuilt,
and each Python package is built in an isolated environment with only its declared dependencies available, avoiding surprises later from unspecified dependencies.

This tooling also works nicely together with direnv (which I’ve written about before in “Per-project Postgres”) for managing a local development environment. Here’s a simplified version of the shell.nix which you can find alongside the above docker.nix:

let
  pkgs = import <nixpkgs> {};
  app = pkgs.poetry2nix.mkPoetryEnv {
    projectDir = ./.;
    editablePackageSources.predictable = ./.;
  };
in pkgs.mkShell { buildInputs = [ app pkgs.poetry ]; }

Then if I just write use nix in .envrc and run direnv allow, when I cd into that project’s working copy, I get a complete development environment automatically.

PHP

I’m not real familiar with the PHP ecosystem, but there are web apps I want to use that are written in PHP, so I dug into the Nixpkgs support for PHP a bit and ended up with a Nix recipe that looks something like this:

let
  pkgs = import <nixpkgs> {};
  lighttpd = pkgs.lighttpd;
  php = pkgs.php;

  config = pkgs.writeTextFile {
    name = "lighttpd.conf";

    checkPhase = ''
      PORT=5000 ${lighttpd}/bin/lighttpd -tt -f $n
    '';

    text = ''
      server.port = env.PORT
      server.upload-dirs = ("/tmp")

      fastcgi.server = (".php" => ((
        "bin-path" => "${php}/bin/php-cgi",
        "bin-environment" => (
          # Nixpkgs 20.09 wraps bin/php to set
          # this but does not wrap bin/php-cgi
          "PHP_INI_SCAN_DIR" => "${php}/lib",
        ),
        # see lighttpd and php docs for more that should go here
      )))
    '';
  };
in pkgs.dockerTools.streamLayeredImage {
  name = "phpapp";
  extraCommands = ''
    # PHP expects to be able to create a lockfile in /tmp
    mkdir -m 1777 tmp
  '';
  config.Cmd = [ "${lighttpd}/bin/lighttpd" "-D" "-f" config ];
}

There were two non-obvious things I had to do. One is that there’s no /tmp in Nixpkgs-generated Docker images by default—in fact, there’s nothing except /nix/store/—and both PHP and lighttpd sometimes need a place to write temporary files. So I used the extraCommands option to create that directory. (lighttpd defaults to /var/tmp which of course doesn’t exist either, so I configured it to use the same /tmp directory instead.)

The other is, I think, a bug in the current release of Nixpkgs, which I ought to open an issue about. PHP quite naturally doesn’t know how to find anything in the directory structure that Nix uses, so the Nix package of PHP auto-generates a php.ini file with the full path to each enabled PHP extension. But then, PHP doesn’t know how to find that php.ini file either, so the package arranges that the php command is actually a shell script that sets PHP_INI_SCAN_DIR to the right path before running the real php binary. However, the package doesn’t wrap php-cgi in the same way, so that binary can’t find php.ini either. As a workaround, I set the variable in the lighttpd config instead.

There’s one more useful trick I learned. By default, the PHP package enables a bunch of extensions, which is good for getting started quickly, but it also pulls in a lot of unnecessary dependencies. Once I had things working, I replaced the php = pkgs.php; line above with this:

  php = pkgs.php.buildEnv {
    extensions = { all, ... }: with all; [
      json
      session
      # any other extensions you want here
    ];
    extraConfig = ''
      upload_max_filesize = 64M
      ; more custom php.ini settings follow ...
    '';
  };

Footnotes

Heroku and others are now collaborating on a standard for “Cloud Native Buildpacks” under the Cloud Native Computing Foundation. Although Heroku defined the idea, this discussion applies to a growing number of other platforms. ↩
I don’t have any good examples of unpackaged dependencies from the web apps I’ve deployed though because it seems there was a perfectly good package of the GeoIP C library at the time that I wrote my own recipe for it—I have no record of why I didn’t just use that, but I may just have not known it was there—and the Python tooling I’m using now makes the Python package recipes I’ve written in the past entirely unnecessary. ↩
Using Nix for deployment means you need to understand at least a little bit about Nix to get started, while it’s possible to get started with Heroku without understanding Heroku’s architecture at all. Personally, I think it’s worth investing a little time up-front to avoid falling off the complexity cliff the moment I try to do something just a little bit unusual. You may feel differently, and that’s okay. ↩
I could rebuild lighttpd against musl libc and without TLS support to get my blog’s Docker image down to little more than the size of the HTML. It isn’t even particularly hard to do. Asking for pkgs.pkgsMusl.lighttpd instead of pkgs.lighttpd does the former, and the tersest way to remove the OpenSSL dependency is:
```
lighttpd.overrideAttrs (old: { configureFlags = []; })
```
The downside is that the CI builds done by the NixOS project don’t cover these build configurations, which means Nix has to rebuild them from source for me instead of using pre-built binaries. Since one of my big goals was to speed up deployment, this isn’t worth doing.

Rebuilding with musl is especially time-consuming because all the dependencies, including build-time dependencies, get rebuilt too. That includes things like perl, so it takes quite a while. ↩

Feed reader cache coherency with JMAP and RFC5005

2020-08-06T20:14:59-07:00

RSS feed readers don’t work as well as they should. Today I’d like to talk about issues from an application and server developer viewpoint; I’ll save some thoughts about usability in various use cases for another time.

In this point I’m going to suggest that, from a technical perspective, the first step to fixing some important issues is to think of feed readers as being composed of two layers of caches. That helps clarify why some of these problems have been hard to solve, and at the same time points toward some existing solutions that we can reuse.

But first: email

I think the most interesting recent development in email standards is JMAP. Despite the similarity of its name with IMAP, the widely used “Internet Message Access Protocol”, JMAP’s RFC8620 doesn’t actually say much about email. Instead it describes a generic “JSON Meta Application Protocol”:

This document specifies a protocol for clients to efficiently query, fetch, and modify JSON-based data objects, with support for push notification of changes and fast resynchronisation and for out-of- band binary data upload/download.

A separate standard, RFC8621, describes how to apply JMAP to sending and receiving email, and there are draft proposals for how to sync calendars and contacts as well.

Two hard problems

There are only two hard things in Computer Science: cache invalidation and naming things. —Phil Karlton

The reason that I think JMAP is particularly interesting is because it sets out to solve cache invalidation problems for a useful range of applications—namely, those where:

you have a data-set on a server,
which you want to access from multiple clients,
and you want the user experience to be as if the client kept the whole data set locally,
but those clients may not have enough resources to do so.

JMAP defines ways for a client to cache a portion of the server’s data set, for the server to inform the client when that cache may be out of date, and for the client to transfer as little data as possible to get its cache back up to date.

I’ve spent a lot of time working with the X Window System network protocol, and learned some good techniques for dealing with high-latency or low-bandwidth networks. From my studies of X I also concluded that its extension mechanism was key to why protocol version 11 survived from 1987 and is still in widespread use today. In both areas, I’m happy to see that JMAP’s designers did an excellent job: they eliminated many round-trip delays and ensured that client and server can mutually agree on which extensions to use.

The JMAP-based specification for email doesn’t have to do much more than describe what data should be associated with an email or mailbox or whatever other objects are appropriate. It inherits from JMAP the ability to efficiently synchronize subsets of the server’s email store, allowing clients to work offline and on flaky internet connections.

Other applications could also build on JMAP. While calendar and contacts specifications are already underway, today I’d like to suggest another use case.

Client/server architectures for RSS feed readers

In my previous post, “WebSub plus Push”, I complained that developers who want to offer software that notices when a web page changes either need to spam the page with “have you changed yet?” requests or put up application-specific servers.

I’m particularly interested in this for native apps that monitor RSS¹ feeds to inform me when I have new stuff to read, such as a news article or a chapter of some serialized fiction or a page of a webcomic.

Some sites I care about have RSS feeds which are largely unusable because their servers treat the repeated “have you changed yet?” requests as an attack. And honestly, that’s fair… except that those sites don’t implement WebSub and desktop/mobile feed readers couldn’t use it even if they did, so there’s literally no other option today.

Getting publishers to implement WebSub is its own challenge which I’m going to set aside for now, because if most feed readers won’t use it then it’s hard to argue that publishers should put in the effort.

Given that the only way forward is for native apps to require a dedicated server that’s always reachable from the internet, what should a client/server architecture look like for a feed reader?

Various web-based feed readers offer APIs intended for mobile and desktop apps; Tiny Tiny RSS and NewsBlur are two examples. The Indieweb community’s answer is Microsub, and the motivation they give is, I think, fairly typical of the field:

Microsub provides a standardized way for reader apps to interact with feeds. By splitting feed parsing and displaying posts into separate parts, a reader app can focus on presenting posts to the user instead of also having to parse feeds. A Microsub server manages the list of people you’re following and collects their posts, and a Microsub app shows the posts to the user by fetching them from the server.

I have several issues with these protocols, but today I’m going to focus on one: None of these protocols do a great job at cache invalidation. Clients can’t tell which data they already have might still be valid, and also must poll the server to learn about changes. (There’s been discussion for Microsub about streaming and sync but apparently no consensus has been reached.)

As you’ve probably guessed, this is where I suggest that JMAP could be a good foundation for a client/server API for feed readers.

JMAP for feed readers

Most of the work needed to define a JMAP-based feed reader API is in deciding what kinds of objects need to be synchronized and what information needs to be attached to each kind of object.

Looking at the three APIs I mentioned above, a first cut at the list of object types is:

Category
Feed
Post

Newsblur might add an extension capability for their “intelligence classifiers”—and I think it’s super important that JMAP standardized a well thought out extension mechanism to allow for exactly this kind of situation—but otherwise I think these are the primary object types in most feed readers.

In existing practice, categories generally just permit setting a name and an optional parent category, allowing them to be organized into a tree—or, technically, a forest. The above APIs also provide read-only information about each category, such as the number of read and unread posts across all feeds in that category. I’d look at the way the JMAP email specification models Mailbox objects for a good starting point here.

Each feed needs to be created by setting a source URL and a category. Feeds have a variety of additional properties that are generally read-only because they come from the publisher. Similarly, posts generally would only be created by the server, from feed entries retrieved from the feed’s publisher, and the client would only be able to set state such as “I’ve read this post”.

But there’s no reason that feeds and posts must be read-only! The exact same data model could work for posting to your own personal blog or social media accounts. Both NewsBlur and Tiny Tiny RSS have dedicated API endpoints for this purpose, and in the Indieweb ecosystem there’s a complementary specification called Micropub. But JMAP reminds us that there’s no operational difference between reading a post written by somebody else versus reading one you wrote on another device.

Microsub delegates the representation of feeds and posts to JF2, which is probably a good starting place for the similarly JSON-based JMAP. However, I’d pay careful attention to JF2’s extension mechanisms. If the JMAP server is going to be transforming RSS and Atom and Microformats-based feeds into a common representation, it really needs to pass information it doesn’t understand to the client, so that client development isn’t blocked waiting for servers to catch up with new extensions.

The last question a JMAP-based specification needs to answer is: What search queries do clients need to be able to ask the server to perform on their behalf? I’d start by looking through the searches offered by existing APIs and comparing them to the Mailbox/query and Email/query sections of the JMAP Mail specification.

Cache coherency between server and publisher

So far I’ve been talking about how a feed reader application interacts with an internet server that acts as an agent for the reader, but there are some related topics to consider in how that server agent interacts with feed publishers.

People working on RSS-related tools usually assume that only the most recent posts are important. This is generally true for content such as news articles or social media posts. But it’s false often enough that many feed readers will save old posts that they’ve seen, even after those posts disappear from the publisher’s feed.²

That doesn’t work very well.

It relies on your feed reader having started to watch the feed before the oldest post you might ever be interested in disappears.
People who follow RSS feeds have just gotten used to sometimes seeing stale contents, or duplicates of posts they already read.
On top of all that, I’ve been told by developers of two different cloud-based feed readers that this practice of saving the complete contents of old posts contributes significantly to their storage and hosting costs.

Fortunately, reframing these as caching problems points us in the right direction to solve them. We just need reliable mechanisms for:

cache invalidation (because the publisher says it changed),
cache eviction (because the client doesn’t have resources to keep it),
and cache reloading (because either invalidation or eviction made a request miss the cache).

All three needs are addressed by RFC5005, “Feed Paging and Archiving”, optionally supplemented by RFC6721, “The Atom deleted-entry Element”. These specifications are from 2007 and 2012 respectively, and I hate how little-known they are.

Section 2 of RFC5005 covers “Complete Feeds”, where the publisher asserts that every post in the history of the feed is, in fact, present in the current feed document. In other words, the publisher is directing any feed reader which encounters such a feed to discard its cached copy of any post that was previously included but isn’t any more.

However, feeds aren’t usually very interesting unless they have enough posts that transferring all of them every time anybody checks for updates would quickly eat up everyone’s bandwidth. So section 4 defines “Archived Feeds”³, which allow splitting the feed up into multiple linked documents. It’s been carefully designed to transfer reasonably minimal information on changes. One of these days I’ll write up why every part of that design is necessary, because the RFC itself is unfortunately terse about rationale.

When a post is added or edited, archived feeds work most efficiently if the publisher appends that post to the end of the feed, no matter how long ago its publication date is. Doing the same for deleting a post requires having something you can append that says “this was deleted”, rather than just silently making it disappear.

That’s where RFC6721 comes in with its deleted-entry element. Systems which don’t already keep track of deleted posts may find it easier to invalidate all archives going back to the point where a deleted post was first published, but that does increase the amount of data that has to be transferred in the (generally rare) situation where a post is deleted. So implementing deleted-entry is a good optimization for larger publishers.

With either complete or archived feeds, a feed reader can freely evict any information about a feed from its cache, reload it later from the original feed, and reliably detect when its cache is out of date.

Conclusions

RSS feed reader software could be more robust, less expensive to operate, and faster at responding to user input by thinking of it simply as two layers of caches: one cache that’s always publicly reachable on the internet, and any number of second-tier caches running on the end-user’s devices. WebSub together with RFC5005, the Feed Paging and Archiving specification, gives us a foundation for the first layer; JMAP is a strong contender for the second layer.

My plea is for developers who are working in the RSS feed ecosystem to implement these standards, and for the rest of you to advocate for these standards with the developers of your favorite tools.

I’ve written several pieces of software related to RFC5005 which may help, including:

a WordPress plugin (see also my WordPress Core issue)
a jekyll-feed patch
a very rough prototype feed reader
predictable, a feed generator for sites with consistent update schedules

This blog is itself using my patched version of jekyll-feed, so you can also see an example of an archived feed at https://jamey.thesharps.us/feed.xml.

If you’ve read this far, you might also be interested in listening to these Hacker Public Radio episodes where clacke interviewed me and my fellow RFC5005 advocate, fluffy: part 1 and part 2.

Footnotes

RSS is the most widely-recognized of the mechanisms for web syndication feeds, so in this article I often use that term generically. Personally, I prefer Atom, and other formats like JF2 or the Microformats-based h-feed are valid alternatives as well. ↩
Of course, there’s also a lot of content out there where it’s necessary to see old posts in order to understand newer ones. Many podcasts and webcomics fall in this category, for example. I have a lot to say about that but it’s a topic for a later post. ↩
I’m skipping over RFC5005’s section 3, “Paged Feeds”, because it explicitly is not intended for the kind of cache coherency I’m talking about in this post, and can’t be used for that purpose. The specification says, “[C]lients SHOULD NOT present paged feeds as coherent or complete, or make assumptions to that effect,” and, “Unlike paged feeds, archived feeds enable clients to do this without losing entries.” ↩

WebSub plus Push

2020-08-03T16:20:46-07:00

There are a lot of situations where it’d be nice to know when the contents of a web page change, and the modern way to do that is the WebSub standard, formerly known as PubSubHubbub.

It’s an elegantly simple system: A page advertises its support by providing a link to a “hub” that handles sending out change notifications. A subscriber sends to that hub a URL where notifications should be delivered.

However, this requires the subscriber to have a web server that’s always reachable from the public internet, which means that desktop and mobile applications can’t take advantage of it without a dedicated server. Even for some web applications this might be the only aspect which requires a server; otherwise they could use much cheaper static-file hosting, or be delivered via something decentralized such as IPFS.

And that’s a shame, because there are only a few options available to developers and they’re all undesirable:

Ignore WebSub and just poll for changes, spamming web sites with “have you changed yet?” requests.
Require users to have the technical skills and resources to set up their own web server.
Build centralized infrastructure, which requires a different set of skills than app development, and gets expensive as it becomes popular.

Push notifications

Standards bodies have been working for years on specifications for how to deliver notifications to applications in a timely fashion without requiring the application to be associated with a dedicated server.

The web Push API makes this a problem for the browser vendors to solve, instead of pushing (hah) it onto application developers. JavaScript can ask the web browser for a push URL, then forward that URL off to an application server which can post to it any time there’s something new. The browser arranges that those messages will get back to the JavaScript service worker… somehow. (The Push API doesn’t need to specify how the browser communicates with the push server or even how it finds one to use; browser vendors just provide their own infrastructure and co-evolve it with their browser products.)

The Push API is built on RFC8030 which specifies how application servers should send messages to push servers. While the Push API is specific to browsers, RFC8030 should in principle be usable for native mobile and desktop apps as well. In practice, as far as I can tell, nobody is implementing it for those settings: mobile OSes only provide proprietary APIs for push notifications, and every desktop app developer has to just roll their own thing. So for the rest of this article, I’m going to focus on web apps.

Naturally, everything is harder than I’d like: the push URL expects to receive messages that are a little different than what a WebSub hub sends. The details are specified in RFC8030, but in particular:

WebSub sends the entire contents of the changed page, while a push server may reject messages larger than 4kB.
Current push servers require the sender to sign the message with a public key provided at registration (see RFC8292, “VAPID”) but there’s no provision for doing so in WebSub.
WebSub verifies a subscription request by immediately sending a challenge to the subscriber’s URL, which an RFC8030 server won’t respond to in the way the WebSub spec requires.
Because the Push API is designed for devices which may be offline, RFC8030 requires specifying a Time-To-Live (TTL) for how long the push server should keep trying to deliver the message. WebSub has no means to convey that or any of the other options that RFC8030 allows.

For all those reasons, using WebSub with the Push API still requires a dedicated server to match each specification’s expectations to the other. So that’s disappointing.

Stateless impedance matching

If we have to have a separate server, can we at least make it really cheap to operate? Here are the limits I wanted on this hypothetical service in order to keep costs down, allow scaling up to any amount of demand, and enable deployment at any cloud or shared hosting provider:

No persistent state: no database, no writable filesystems.
No background or scheduled processing: all work happens while responding to an HTTP request.

I think this is feasible, but I haven’t actually tried it, for reasons I’ll get into later. First I’ll just outline how I think a lightweight WebSub proxy would work.

The most complex part is the process for initially setting up subscriptions:

An application asks the WebSub proxy for its VAPID application server key, as defined in RFC8292.
The application passes the application server key to the Push API to get a new push URL.
The application posts the push URL in an HTTP request to the WebSub proxy, which doesn’t reply immediately.
The proxy signs or HMACs the push URL with its own key, and constructs a subscription URL containing the signed push URL.
The proxy sends the new subscription URL to the application via the push URL. (This confirms that the application can receive messages sent to that push URL.)
The proxy returns an error if the push URL returned an error; otherwise it returns an empty response with a “202 Accepted” status code.
Once the application receives the push notification containing its new subscription URL, it uses that URL in subscription requests to WebSub hubs.
When a hub sends a verification request to the proxy, the proxy checks that the wrapped push URL was signed by its own key.

Later, when the hub sends a change notification to the proxy, it again checks that the wrapped push URL was signed by its own key, and then passes the notification on to the push URL, signed with its application server key.

During initial registration, the application could also provide values for the TTL, Topic, and Urgency headers as defined in RFC8030. The proxy would encode those into its subscription URL alongside the push URL and use them when forwarding change notifications.

I think this satisfies all requirements of both specifications while being just as scalable to tiny single-person installations as it is to multi-data-center worldwide deployments.

Same-origin strikes again

There’s just one problem, which reveals itself in several different ways.

URLs that are of interest for use with WebSub are, almost by definition, not at the same origin site as any JavaScript application that would try to subscribe to them. If they were part of the same site, the application wouldn’t need WebSub to find out about changes.

JavaScript can’t access cross-origin URLs without the target site opting in using CORS, and I doubt any WebSub deployments opt in. That means the client web app can’t access pages to check whether they support WebSub, and it can’t reliably send subscription requests either.

We could work around those issues by making the WebSub proxy also be a CORS proxy, forwarding requests to arbitrary web sites. But that opens up a variety of security and abuse concerns, and also could make the proxy significantly more expensive to run.

Alternatively, there are existing CORS proxies out there, though as far as I can tell none of them support the POST requests necessary for establishing a subscription at a hub. I’m guessing that’s because of the same potential for abuse.

On top of that, once you have a subscription going, WebSub delivers the current contents of the page to you every time it changes… but we can’t count on being able to deliver that through a push server, which may have a length limit as low as 4kB. So the application needs to request a page that’s already been delivered to the proxy, which suggests that it would be more efficient for the proxy to cache the copy it received from the hub, and return that cached copy to the application on demand. But that means having state that persists across multiple requests, which again makes the proxy more expensive to operate.

And if the proxy is going to be a general CORS proxy or at least cache WebSub-delivered content, then it might as well be fully stateful and coalesce multiple subscription requests for the same page into a single subscription at the hub.

Conclusion

In short:

solving the small question of connecting WebSub with Push is straightforward,
but the same-origin policy effectively negates all the advantages,
so we’re probably better off spending our time building application-specific servers.

Or in other words, the web is a terrible place and makes me cry.

Coda

While I was digging into this, I did find one other direction the world could have gone:

In parallel with the early development of PubSubHubbub, RFC5989 described a way to use SIP for notification of web page changes. Since mobile phones already use SIP for voice call setup, I could imagine adopting it for mobile push notifications as well. If that was exposed to applications, standards like RFC5989 would have provided direct interoperability between mobile apps and the kinds of web sites that use WebSub today.

But SIP is quite different from HTTP, while WebSub and Push both use HTTP in mostly straightforward ways. So I suspect RFC5989 faces pretty significant barriers to adoption, comparatively speaking. At minimum, I can’t find any evidence of anyone having ever implemented RFC5989…

Authenticating shared web caches

2020-06-13T18:33:00-07:00

I was doing a mental exercise and it went someplace that I think is not terribly useful but still interesting enough to share.

Say you want to give somebody a copy of a web page that you’ve previously retrieved, and you want to convince them that what you’re giving them really did come from the original web server at some point.

This could be useful, for example, with the Internet Archive. If I want to contribute my own crawl results, how can the Archive verify that I didn’t forge the pages I’m submitting?

Note that the recipient can’t just fetch the same page themself and compare it to what you gave them. It may have changed since the time when you retrieved it, or they may not have access to that particular document.

What I really want is for the origin server to have digitally signed the document. There’s an Internet Draft for Signed HTTP Exchanges which does exactly that, but for the purposes of this exercise I want to be able to do it for any web server as it is deployed today, rather than requiring support for a new protocol.

So how else can I get an existing web server to sign its response for me?

Abusing TLS

If I’m using HTTPS then the server is already sending me a digital signature that’s publicly verifiable against their TLS certificate, in the CertificateVerify message during the TLS handshake. Can I take advantage of that?

Well, not exactly. That signature only covers the preceding parts of the handshake. TLS uses an assortment of new symmetric keys for everything else, all derived from a random key generated through a Diffie-Hellman ephemeral key exchange.

The TLS handshake does guarantee that as long as neither party has leaked any of the ephemeral keys, then the client can trust that the server it’s talking to also posesses the private key corresponding to the server’s public certificate.

Because the cryptography in use is based on symmetric keys, by definition both the client and server share all the same keys after the handshake. That means that either end could forge messages pretending to be from the other. But that’s okay, because if I receive a message from you that claims to be from me, I’m not fooled. The worst either end could do is willfully fool themselves, and there are easier ways to do that.

Those guarantees go out the window as soon as I try to convince a third party of the authenticity of these messages, though.

The only way I can think of to do that is to reveal the ephemeral keys, together with the full encrypted stream. At that point, the recipient can verify that the entire stream was produced by somebody who had those keys, and verify that the server’s offered certificate was valid at the time of the conversation. But they can’t verify that I didn’t forge part or all of the application data. In fact, they’re now in a position to do their own forgeries!

Trusted timestamping proxy

I think there’s a simple fix for that problem, although it makes the solution much less useful than I hoped for.

Let’s introduce a simple proxy server. I’ll connect to the origin server through this proxy. The proxy will log the bytes that are sent back and forth through it, but it won’t understand them: it’ll only see the encrypted TLS traffic. At the end of the connection, it will give me a digital signature over a timestamp plus the complete connection log. I can check that the signature from the proxy matches what I saw from my end of the conversation, and discard the log if not.

(Introducing this proxy doesn’t change the security story for TLS: it already had to be resilient against passive or active attackers sitting between the client and the server.)

Now I can give you the signed connection log, plus the ephemeral keys used during the connection.

If you trust that the proxy only produces signatures over traffic that actually passed through it, then you have enough information to verify that I’ve given you an authentic conversation with the origin server.

unless…

Well, almost. If I can convince the proxy to connect to a server under my control, that server can complete the handshake with the origin server, then share the ephemeral keys with me via another channel, and we can collaborate to conduct a completely forged conversation through the proxy, getting its signature on traffic that the origin server never saw.

Part of the process of verifying a TLS certificate involves checking that the DNS name in the certificate matches the DNS name the client was trying to connect to. We can’t make the proxy entirely trustworthy, but we can make it less untrustworthy if it does its best to ensure that the server it connects to corresponds to the requested DNS name.

At the least this means any DNSSEC signatures should be included in the proxy’s traffic log, so a recipient can be convinced that the DNS response came from the same origin.

I don’t know of any way to secure the IP connection, though.

Wilder options

Personally I feel like I’d be pretty willing to accept evidence from a trusted timestamping proxy like this, assuming the proxy were operated by a reasonably neutral party.

If that isn’t enough for you, I suspect there’s a way to use secure multi-party computation to have the proxy and the client both participate in the conversation, in such a way that neither is capable of leaking the ephemeral keys during the conversation.

But this is already more complicated than I hoped for, and doesn’t offer much advantage over just having people fetch pages from origin servers themselves. So I’m calling it a day.

Reading the TLS 1.3 specification carefully enough to work all this out was sure an interesting exercise though!

The most general reverse proxy

2020-04-24T15:15:00-07:00

Deploying web applications is hard to do, and there are a bewildering variety of ways to do it. Today I’d like to confuse the matter further by proposing yet another approach. As a preview, here are the components I’ll discuss:

TLS termination and when to not do it
TLS Server Name Indication to route entire connections
PROXY protocol to forward clients’ addresses without parsing the request stream
Unix sockets to communicate with backend applications
Unix permissions to isolate applications

Background

There is, of course, no one best way to make a web site available to the world, because people’s requirements vary a lot. For example, maybe your site is like this blog, which I write in Markdown and compile to static HTML using Jekyll, and any static hosting service will do. Things get more complicated if you want to do any server-side computation, or even just control your response status codes and headers.

Personally, I want to be able to put a new experiment online quickly at any time, and I want to be able to use any feature of the evolving web (such as WebSockets or HTTP/2 Server Push), with any kind of server-side framework, database, or other tooling. To that end, I currently have a dedicated server hosted by Hetzner, so that I can try anything that I’m capable of making work on bare metal.

“Platform as a Service” (PaaS) offerings like Heroku are popular for quick experiments that might turn into big projects. They impose constraints on your application that make it easier to scale, at least for them, and maybe also for you. For example, your application can’t store any persistent data to its local disk; it has to use an external service like a database or storage provider. This allows applications to crash without losing data, and to migrate across physical machines easily, and to be scaled up by running duplicate copies of identical code on as many physical computers as necessary.

But I’m not concerned about making my web sites scale; I desperately hope I never create the next Google or Facebook. A single computer these days can serve an awful lot of people so long as you don’t write code that’s accidentally quadratic or otherwise unnecessarily resource-intensive.

I don’t care about scaling and I don’t want constraints, so Heroku et al offer exactly the wrong trade-off for me. But there is one feature I love from providers like Heroku: As a developer, your deployment process is little more than recording what software you need installed and then running git push.

In order to get that simple workflow on my own server, I’ve been using Dokku for a couple of years. Built on top of the Herokuish compatibility layer, Dokku provides a fully open-source single-server PaaS that can generally run the same application images that Heroku can. Dokku doesn’t have the breadth of one-click third-party integrations available that Heroku does, but on the other hand, it allows you to use arbitrary Docker settings such as persistent volume mounts which don’t fit Heroku’s scaling model. Anyway, I never wanted to pay for any of those third-party integrations, so I don’t miss them.

Unfortunately, I’ve found that Dokku still imposes constraints I don’t like. I believe fixing all of them would push it too far from being compatible with Heroku, and therefore it wouldn’t be the same project any more. So I’ve been working out what I want from first principles.

My requirements

I want to share a single server and a single IP address among multiple applications, possibly managed by different people, who may all distrust each other to some degree. For my personal server these would be friends who I just want to protect from accidental interference, but ideally it wouldn’t be a big leap from there to real security.

Authorized people should be able to add new applications with minimal or no involvement from the system administrator, depending on how much trust the administrator wants to extend; but they had better not be able to interfere with other people’s applications.

I want to use only standard Unix permissions to control how people with access to the server can interact with their own and others’ applications.

Sharing an IP address requires multiplexing the HTTP and HTTPS ports between the different applications, but the reverse proxy which does that should be as minimal as possible. Minimalism is good for security, but that isn’t the key reason. If the proxy processes the protocol stream in any way, that can impose several different kinds of constraints on backend applications:

Any interpretation it does means that it can only forward traffic which it understands, keeping applications from taking advantage of the full features of HTTP/2 or other specifications. For example, as far as I can tell it is currently impossible to implement RFC8030 (Generic Event Delivery Using HTTP Push, part of the Push API) if nginx is your reverse proxy, because that server only supports sending a push promise that it immediately resolves.
Imposing arbitrary limits, such as limiting uploaded file size or providing a “Web Application Firewall”, means an application might need those limits configured. There is no one-size-fits-all configuration. That means untrusted people need to be able to change the configuration of a shared component, which makes it way more difficult to show that people can’t interfere with the configuration of each other’s applications.

Each application should be assigned one or more hostnames, which are not shared with any other application. Sharing at the level of subpaths or other properties of a request gets substantially more complicated. More importantly, hostnames are special in HTTPS because they’re provided early via TLS Server Name Indication, so the system can make routing decisions without inspecting the HTTP protocol stream at all.

I thought I wanted the proxy to provide TLS termination, so the administrator can provide wildcard certificates and DNS records for some domain. That would allow people to host applications at any subdomain of that domain without having to configure TLS or DNS themselves. However, the HTTP/2 specification says that browsers are allowed to send requests for any domain listed in the server’s certificate over the same connection. Per section 9.1.1 of RFC7540:

For example, a certificate with a subjectAltName of *.example.com might permit the use of the same connection for requests to URIs starting with https://a.example.com/ and https://b.example.com/.

That means routing based on SNI is only safe if each certificate is associated with exactly one backend. So instead I require the backend application to be responsible for provisioning its own TLS certificates, probably using ACME/Let’s Encrypt.

I’d like to have the option of using Docker, KVM, etc, but I also want the option to not use any container or virtualization technology for a given application. I want to run an application and all its dependencies, such as databases, caches, and task queues, entirely under my own privileges. I want to be able to use any software I can figure out how to build, and it shouldn’t be necessary to run any of it with elevated privileges.

It’s nice to support zero-downtime rolling upgrades even on a personal server used primarily for experiments. I want to start up a new version of an application and test that it’s working, before atomically switching outside traffic over to it and gracefully shutting down the old version. But I want to do that without reloading or reconfiguring the reverse proxy, because again, otherwise it’s hard to show that people can’t interfere with each other’s applications.

Solution

I haven’t found any existing software that I’m fully satisfied with, so I wrote my own, with a focus on minimizing the amount of code that has to be trusted. I used Rust, for its combination of memory safety, performance, and event-driven async/await syntax, to write two reverse proxies:

sniproxy for routing TLS connections according to SNI
httpdir for routing HTTP/1.1 requests according to Host header

They follow very similar designs. Neither program takes any command-line options, or even a configuration file in a traditional sense.

First off, you have to pass a listening TCP socket as stdin (file descriptor 0). There are many ways to do that, whether using classic tools like inetd or new alternatives like systemd. If you need to listen on multiple sockets, run a separate copy of the proxy for each socket.

This has several advantages:

It keeps the implementation simpler and easier to review because there are no configuration options to parse.
It means that the proxy itself never needs to have elevated privileges, which are otherwise required to bind to port numbers below 1024.
A service manager can hang on to the socket and later pass it to a new version of the proxy, while allowing the old version to finish any connections it’s already handling and gracefully exit.
If the single-threaded proxy turns out to be CPU-bound, the administrator can run multiple copies of it listening to the same socket, and pin each one to a separate CPU core.

Second, create a directory to hold your backend configuration, containing a subdirectory for each hostname you want to serve. I’ll call this directory hosts/ here. When you start either of these proxies, this top-level configuration directory must be its current working directory.

Unix permissions on the hosts/ directory determine who can create, delete, or rename a host. You could make it writable only by root, but if you want to be more permissive you can set the “sticky bit” (e.g. mode 1777) so that, like /tmp, anyone can create a directory but can only delete or rename directories they own.

Hostnames must be represented without a trailing dot (.) and in lowercase ASCII. So for international domain names, use the “A-label” form: if you’re hosting a web site at “🕸💍.ws”, then you’d put the backend configuration for that site in hosts/xn--sr8hvo.ws/.

Since you can’t create multiple directory entries with the same name, the filesystem enforces that all configured hostnames are unique. It does not enforce that all configured hostnames are legal or valid, but the proxies will reject an illegal hostname without checking for its presence in the filesystem, so any misconfigured backends will just be silently ignored.

In this configuration, a directory like hosts/jamey.thesharps.us/ should be owned by the user and/or group who is authorized to manage the application.

Each hostname subdirectory can have these files:

http-socket (required for httpdir): a Unix domain socket that your backend application is listening for HTTP/1.1 connections on.
tls-socket (required for sniproxy): a Unix domain socket that your backend application is listening for TLS connections on.
send-proxy-v1: if this file is present, sniproxy will prefix every connection to this backend with a PROXY protocol v1 header.

Any of these sockets or directories may be symlinks, which allows atomically switching them to connect to a different socket at any time. However, those symlinks should be relative and should only point within the same directory, because ideally the proxy should be run in a chroot and so may not have access to sockets located elsewhere.

You can change the configuration without restarting the proxies: they look up the target socket each time a connection comes in, and don’t need to know which hostnames to serve in advance. This is especially important if untrusted users will be managing any hostname configurations because allowing them to restart or reload the reverse proxy is risky.

Alternatives

I examined a lot of existing options before giving up and writing my own.

I’d like to give a special shout-out to Melvil, a proof of concept my friend Getty wrote some years ago, as evidence that my ideas are hardly unique. I independently ended up with a very similar design, although Getty was trying to put more functionality into the proxy layer, while I aimed to identify the least configurable thing which could provide the most flexibility to the backends.

Aside from Melvil, every alternative I found would have required me to write shell scripts or similar to construct a suitable configuration file from the directory structure I wanted to use. I was concerned about the possibility of directory names which aren’t valid hostnames but might lead to syntactically invalid configuration files if I substituted them naively into a template. Also, any time some backend’s configuration changed, I’d have to reconfigure or restart the proxy.

The classic TLS termination proxy is stunnel, and I thought it looked promising for a while, but I never quite managed to make enough sense of its documentation to put together even a minimal proof of concept. It seems to support a variety of protocols and configuration options, but it isn’t clear whether they work together in all combinations. I got particularly confused about whether turning on PROXY protocol support would affect the incoming connection from the client or the outgoing connection to the backend. Anyway I eventually decided I didn’t want TLS termination at all, and that’s all stunnel really does.

All the remaining alternatives that I considered put both TLS and HTTP/1.1 routing in the same application. I believe this is a task which should not share any state across connections, so there’s no reason to handle more than one socket per process. And there are good reasons to isolate parsers for different protocols from each other: a buggy HTTP parser shouldn’t be able to leak information from or deny service to another TLS connection, and vice versa.

Part of what convinced me that routing based purely on SNI could work was an existing SNI Proxy project. I read that implementation carefully and quite like a lot of the ideas in it. I didn’t have a lot of confidence in the TLS protocol parser though—which says “This was created based primarily on Wireshark dissection of a TLS handshake and RFC4366”—and in fact I believe that implementation would break if a client sent fragmented TLS records. Whether that’s actually a flaw remains an open question, since well-behaved clients don’t normally do that during the TLS handshake, but I decided I wanted to accept anything that the TLS 1.3 specification allows.

I spent the most time fiddling with haproxy. Eventually I had a somewhat sketchy configuration that I was pretty happy with, which would terminate TLS but use only SNI to decide which backend to route the decrypted stream to. This was the point when I discovered that HTTP/2 specifically allows clients to reuse a connection for a different hostname, as long as both hostnames appear in the certificate that the server presented for that connection. Honestly I think aggressive connection reuse is a good idea, but it made my prototype worthless so I was pretty grumpy for a bit.

Anyway, terminating TLS meant that haproxy had to negotiate extensions like ALPN before identifying what backend to connect to, rather than letting the backend decide what extensions to support. I had to hard-code things like “this server supports both HTTP/2 and HTTP/1.1”, and then force the backend to deal with the result. In my experiments I was quite happy with nghttp2’s nghttpx for providing transparent HTTP/2 support to backends that didn’t have it natively. But then I realized that haproxy could do the same thing, so I folded that into my haproxy configuration… before abandoning that approach entirely.

One of the first tools I tried was Traefik, which is designed around letting some container management system provide the information about which backends are configured. I thought about using the file backend and pretending to be a container management system, even for my backends that weren’t actually in containers. But I couldn’t see how to ensure that a given hostname was only managed by the user or group I authorized to do so, unless I wrote a custom plugin, and I didn’t want that level of complexity.

I also tried coercing nginx into giving me this kind of delegated authority over configuration. That seemed almost plausible! Because nginx include directives require that the named file be syntactically valid by itself, if I wrapped the user configuration inside a location / section, I could have relatively simple scripts generate templated configuration that would at least somewhat isolate backends from each other. Of course there are some kinds of configuration someone might reasonably want that isn’t allowed inside a location section, but this seemed like it would support most of the common needs at least.

But a lot of nginx configuration directives can have global effects. For example, if caching is turned on, any location can set an arbitrary cache key, potentially overwriting cache entries for other hosts. Turning off caching mitigates that particular problem, but then backends might need to configure their own caching layer, possibly by running another copy of nginx. If they have to manage a more capable reverse proxy anyway, why put another one with all that code sitting unused in front of it?

Anyway, for this approach to work I would need to audit every configuration directive of every module, and re-audit on every upgrade. So that’s definitely not an option.

Finally, I also considered Caddy and h2o, but both have a single configuration file with no clear method for safely templating it from untrusted configuration fragments, so I didn’t see much advantage to examining them further for this purpose. They do both look very promising for backend use, though; h2o was easier to configure as a quick static file server than anything else I tried.

Results

In total I wrote about 650 lines of Rust source for the two reverse proxies and I can explain every line of output from running either one under strace, so I feel I did pretty well toward my minimalism goals. I spent about two weeks putting these tools together, once I gave up on the existing alternatives.

70% of that source code is in sniproxy, since it contains my parser for extracting the SNI extension from the initial ClientHello message. It’s also a bit more sophisticated about handling signals, graceful shutdowns, and long hostnames, so httpdir might yet catch up somewhat in complexity. On the other hand, httpdir compiles to a much larger binary, because it pulls in an entire HTTP implementation from the hyper library.

I haven’t reached the point where I can git push to deploy my applications behind these proxies, but at this point the only other thing an administrator has to provide is a per-user service manager that starts at system boot. That could be by running loginctl enable-linger on a systemd installation, or starting per-user copies of daemontools or supervisord or anything else for each user to add their own services to.

As long as users have some way to set up long-lived services that get started again automatically after a reboot, and the ability to add host configurations for httpdir and sniproxy, it’s possible to build the rest of the glue needed for a “PaaS” without administrative privileges. I have some plans for how I intend to do that using Nix, but I hope to see other people tackle that part of the challenge in different ways too, because I’m certain I won’t get that right on the first try.

So please check out sniproxy and httpdir and see if they’d be useful to you too!

Transit fares are not about money

2019-12-12T12:45:00-08:00

I live in Portland, Oregon, with a metropolitan area of about 2.5 million people. We have a pretty decent mass transit system here, called TriMet. I don’t have a lot of experience with other cities’ transit systems, but I do know a few things about this one.

A few years ago I wondered why TriMet made a particular policy decision, and when I looked into it I discovered that my intuition about government budgets was wrong. Today I’d like to walk you through the things that surprised me, using a case study on what transit riders pay to ride the bus here.

For 37 years (from 1975 to 2012), TriMet designated a small part of their service area as “Fareless Square”. This area primarily covered downtown Portland, which is the hub where most bus and light-rail routes meet. As long as you were riding only within that area, you didn’t have to pay anything to ride.

While I was a student at Portland State University in downtown Portland, Fareless Square was fantastic, both for students like myself and for guests the university hosted. For example, on the somewhat surreal occasion when I as an undergraduate student got volunteered to host a meeting of Larry Wall and the rest of the Perl 6 team, it was trivial for me to take everyone to a nice restaurant for dinner; we just hopped on the light-rail and didn’t have to deal with fares for a dozen out-of-towners.

There were several reasons given for ending Fareless Square, but officially the big one was that TriMet faced a budget shortfall and predicted they’d get $2.7 million more in revenue per year if they quit giving people free rides. (For more detailed history and other concerns about Fareless Square, see KGW’s article: What was TriMet’s ‘Fareless Square’ and should it come back?) My memory may be faulty, but I think I recall the total shortfall was expected to be around $15 million.

Compared to my household budget, those are big numbers! It made sense to me at the time that a budget shortfall of that magnitude would be a big deal. I was really unhappy about losing Fareless Square, but I accepted their claim that this was necessary to solve their budget woes. But was it really?

Because TriMet is a public agency, their annual budget report is also public information, and a few years ago in a fit of idle curiosity I actually read it. I’ve just updated my findings against TriMet’s FY 2020 budget, which I encourage you to look at if you have difficulty sleeping.

About 95% of TriMet’s revenue comes from three sources:

60.1% from payroll taxes,
19.6% from federal and state funds,
and only 16.1% from passenger fares.

Passenger fares are expected to come to about $112 million dollars this year, a figure which has not changed significantly since the 2013 fiscal year. (The budget’s exhibit 2 provides these numbers for every year from 1973 onward, if you’re curious.) And passenger fares are the smallest portion of their revenue. So it’s safe to say $2.7 million was a drop in the bucket.

Given these facts, I feel cheated. Fareless Square was awesome and they took it away for such a paltry amount of money? I’m left believing that finances were just an excuse they knew people wouldn’t argue with, not the real motivation. (Again, see the KGW article for other possible concerns that may have motivated TriMet management.)

When I realized that, I was angry. I wondered, what if TriMet had gone the other direction? What would happen if we made all of TriMet fareless? It wouldn’t be the first transit system to provide free public transport. There are well-known advantages for rider satisfaction and system efficiency if everyone can just get on and off without the hassle of dealing with fares. And of course as a matter of public policy it’s good for both traffic congestion and air quality if more people can be encouraged to use mass transit.

There are quite a few important questions that would need to be answered to make that policy proposal succeed, and I’m not interested in addressing most of those questions in this post. For example, a fareless system would have more riders and correspondingly higher costs, and any funding proposal would need to take that into account. But I want to continue focusing on things that surprised me when I threw out my assumptions and actually examined TriMet’s budget.

Given the earlier numbers, we know that today we’d have a $112 million budget shortfall that we’d have to make up somewhere. But wait—collecting those fares isn’t free. TriMet has budgeted over $10 million this year for personnel, materials, and services in the “Fare Revenue & Administrative Services Department”. There are some relevant costs elsewhere in the budget as well.

So let’s say the actual shortfall is about $100 million.

Here again my intuition was wrong. “Obviously,” I reasoned, “the system is funded by passenger fares and would collapse without them.” But funds from local, state, and federal taxes are almost 80% of the budget, and the fare revenue is only 16%.

For my whole life, a major controversy in Oregon government has been around the use of property taxes to fund significant parts of state and local government programs, so my first guess was that must be the taxes funding TriMet. (If you’re curious about the history of this controversy, you can read up on Oregon ballot measures 5, 47, and 50.) So based on that assumption, I did the math and worked out that if Portland Metro area voters would approve a property tax of 34 cents per $1000 of assessed value, that would make up for the lost fare revenue. But it turns out that hardly any TriMet funding comes from property taxes. As far as I can tell, it’s at most $3.75 million of the budget, and probably less than that. Given the controversy and the fact that property taxes are not a significant part of TriMet’s budget today, I didn’t consider this possibility further.

Where else could that money come from?

The largest portion of TriMet’s tax revenue comes from payroll and self-employment taxes levied on employers across the region, amounting to $382 million over the last year. What if businesses were interested in getting on board with a fareless plan?

The 2019 TriMet payroll tax rate is approximately 0.75%. For comparison, federal payroll taxes for Social Security and Medicare add up to 7.65% this year in employer contributions, which is ten times as much. A few years ago when I asked my employer’s accountant how much they actually paid in TriMet taxes, they considered it practically a rounding error in their overall budget. Now that I’m self-employed I pay TriMet taxes myself and they’re tiny for me too, especially considering the 15.3% tax I have to pay for Social Security and Medicare.

My first instinct when I discovered that TriMet payroll tax revenue was $382 million was to figure out how much that has to grow to cover the $100 million of fare revenue, which works out to a 30% increase. But asking businesses to pay 30% more taxes sounded like something they’d fight vigorously, so I knew I needed to look at this more carefully.

How much money are we really talking about? Increasing the TriMet payroll tax by 30% would bring it from around 0.75% to roughly 1%, which is still not much. We can break this down by how much more an employer would pay each year per employee based on that employee’s wages:

$60 for full-time minimum wage employees: The Portland Metro area minimum wage is currently $12.50/hour. Increasing the TriMet tax to 1% would cost employers about $60 more per full-time minimum-wage employee per year. In exchange, these employers could make it easier for both their employees and their customers to get to their place of business.
$150 for employees earning the average salary for the region: According to Exhibit 7 in TriMet’s budget, “Local Economic Trends”, the average pay per employee in this region is $62,717 per year. For each employee earning that much, increasing the TriMet tax to 1% would cost their employer about an extra $150 per year.
Less than the cost of monthly bus passes for employees earning up to $500,000 annually: Many employers in the region buy TriMet tickets and monthly passes for their employees, which cost up to $1,200 per year per employee. If TriMet went fareless, then employers would no longer need to pay that cost, but they’d pay more in the TriMet tax. Still, they’d actually save money on every employee who makes less than $500,000 per year!

Once I broke the costs down this way, I learned that on the scale of a metropolitan region of the size of the Portland area, even numbers like $100 million just aren’t that big. That was the biggest surprise for me.

Many people won’t even consider a proposal like fareless transit on the grounds that it sounds too expensive, but as I discovered, our intuitions about finance mislead us when we think about budgets on the scale of governments.

I don’t mean to suggest that fareless transit should be a no-brainer, because these sorts of proposals face many more objections than just “it’s too expensive.” What I would like you to take away from this story are these points:

If you would like to persuade someone that a government policy proposal is a good idea and they object that it would be too expensive, you may well be able to disabuse them of that notion. But bear in mind that they are likely to have more concerns hidden behind that one, and you should work to identify what those concerns are so you can have a real dialog.
If you would like to persuade someone that a government policy proposal is a bad idea, you should be very cautious about using the argument that “it’s too expensive,” and make a sincere effort to estimate what the proposal would cost. Otherwise you may find your argument quickly torn to shreds. I encourage you to think carefully about which of your values the proposal offends and make your argument based on those instead.

non-O(n) musings

Breaking ciphers with regular expressions

Motivation: data recovery

Example: Caesar ciphers over structured plaintext

Generalization: Weighted regular expressions

Caesar weights

Related work

Conclusion

Z3 adventures and fast Boolean matching

Z3 adventures

Finding functionally equivalent Boolean circuits

Heuristics

Truth tables as numbers

Output polarity

Input polarity

Input permutation

Evaluation

Implementation details

Add ignored inputs

Evaluate cofactors

Count 1s

Swap subsequences of rows

Negate the circuit’s output

Negate the circuit’s inputs

Swap two of the circuit’s inputs

Combining truth tables

Detecting and removing ignored inputs

Enumerating all NPN classes

Conclusion

How not to improve video conferencing

Secure audio conferencing

Estimating feasibility

Is this worth doing?

Efficient peer-to-peer video conferencing

Centralized platforms

Everyone sends directly to everyone else

Overlay multicast

Is this worth doing?

Conclusions

Footnotes

XCB: protocol specifications as data

Quickly building minimal Docker containers with Nix

How it works now

Building minimal Docker images quickly with Nix

Integrating Nix with Dokku

Conclusion

Appendix: Language-specific examples

Python

PHP

Footnotes

Feed reader cache coherency with JMAP and RFC5005

But first: email

Two hard problems

Client/server architectures for RSS feed readers

JMAP for feed readers

Cache coherency between server and publisher

Conclusions

Footnotes

WebSub plus Push

Push notifications

Stateless impedance matching

Same-origin strikes again

Conclusion

Coda

Authenticating shared web caches

Abusing TLS

Trusted timestamping proxy

unless…

Wilder options

The most general reverse proxy

Background

My requirements

Solution

Alternatives

Results

Transit fares are not about money