Jekyll2023-07-04T19:28:51-07:00https://jamey.thesharps.us/non-O(n) musingsSometimes linearity is overrated.Breaking ciphers with regular expressions2023-07-04T19:07:20-07:002023-07-04T19:07:20-07:00https://jamey.thesharps.us/2023/07/04/breaking-ciphers-with-regular-expressions<blockquote>
<p>Some people, when confronted with a problem, think “I know, I’ll use
regular expressions.” Now they have two problems.</p>
<p>— <a href="http://regex.info/blog/2006-09-15/247">Jamie Zawinski</a></p>
</blockquote>
<p>Regular expressions are frequently used for tasks where they aren’t a
great choice. On the other hand, the implementations of regular
expressions that we all use are much less capable than formal-languages
theory allows, so they go unused in many cases where they could be a
great fit.</p>
<h1 id="motivation-data-recovery">Motivation: data recovery</h1>
<p>Years ago I was working on a project that involved tedious data
collection, performed by non-technical people all over the world, when
we started getting sporadic reports of data corruption. We gathered the
corrupted data files and found that there were 512-byte chunks either
re-ordered or just outright missing. We really didn’t want to ask people
to just try collecting the same data again, because it was difficult to
collect the first time. And we couldn’t reproduce or identify the cause
of the data corruption, so we expected we’d see more cases pop up.
Instead, I set out to see what we could recover from the corrupted
files.</p>
<p>For reasons which had seemed like a good idea to people previously
working on this project, the files were “encrypted” using a homebrew
cipher. Each file indicated at the beginning which cipher parameters it
was encrypted with, and the cipher state changed continuously over the
course of the file. So losing some blocks and re-ordering others meant I
didn’t know any of the cipher state at any given point, and therefore
recovering any data seemed impossible.</p>
<p>However: It’s widely understood now that you should never roll your own
cryptography, and this project illustrated why. The cipher was complex
enough to look like an effective secrecy measure at first glance, but
the ciphertext it produced actually leaked a lot of information about
the plaintext and the cipher state.</p>
<p>Just looking at ciphertext side-by-side with the corresponding
plaintext, I could easily see the same patterns in both. The plaintext
had a simple structure that I could easily describe with a regular
expression, and in an early experiment, I was able to write a related
regular expression that could distinguish between ciphertext from these
files versus unrelated data. Despite the apparent complexity of the
cipher, it failed these basic tests.</p>
<p>In the end, I was able to write a program that would decipher any
512-byte block in a fraction of a second. It also reported what the
reconstructed cipher state was at the beginning and end of each block,
which allowed me to put blocks back into the right order by finding the
edges where the cipher states matched. I wrapped the analysis tools I
built into a self-contained program that automated the process of
recovering as much data as possible. That project continued using this
cipher-cracker which I built for years after I left.</p>
<p>I’m not going to give any details about the actual cipher: I was
politely asked not to, and besides it’s not actually that interesting.
But I will show you how to recover cipher state from a much simpler
cipher when you know the structure of the plaintext, using the same
technique I used.</p>
<h1 id="example-caesar-ciphers-over-structured-plaintext">Example: Caesar ciphers over structured plaintext</h1>
<p>A <a href="https://en.wikipedia.org/wiki/Caesar_cipher">Caesar cipher</a>, or shift cipher, is a particularly simple kind of
cipher, but it’s still complex enough to illustrate the technique I want
to focus on today. It involves “rotating” the alphabet a certain number
of positions, where the amount of rotation is a secret shared between
the sender and receiver of the message. For example, a shift of 3 means
that “A” becomes “D”, “B” becomes “E”, etc. When you get to the end of
the alphabet, wrap back around to the beginning: “Y” becomes “B” and “Z”
becomes “C”.</p>
<p>Now let’s say that you’ve intercepted a secret message where you know
the contents are obscured with a Caesar cipher, but you don’t know the
secret rotation amount. If you assume the message was written in, for
example, English, then there are well-known ways to break this cipher.
One way is to count how often each letter occurs in the ciphertext and
guess that the most common letter represents the most frequently used
letter in English (perhaps “E” or “T”).</p>
<p>But instead let’s say the message is produced by a machine, and its
messages don’t match English letter frequencies. If you know a regular
expression which should match the plaintext, you can break this cipher a
different way.</p>
<p>Let’s look at some simple cases first.</p>
<p>If the machine only produces messages using a single letter, then the
ciphertext must also contain repetitions of a single letter. If the
plaintext should match the regular expression <code class="language-plaintext highlighter-rouge">A+</code>, and you see that the
ciphertext matches <code class="language-plaintext highlighter-rouge">D+</code>, then you know that the message was shifted by 3
letters.</p>
<p>Similar reasoning works if the machine only uses a few letters of the
alphabet. If the plaintext should match <code class="language-plaintext highlighter-rouge">[A-D]+</code> and the ciphertext
contains both the letters “B” and “E”, then the only possible shift is
1: no other shift amount could produce both of those letters.</p>
<p>Or perhaps the message matches the concatenation of multiple different
regular expressions. Take, for instance, <code class="language-plaintext highlighter-rouge">[A-D]*E[FG]*</code>. If we knew
where the sole “E” was in a message we could look at the corresponding
position in the ciphertext and immediately determine which shift was
used. But in this case the “E” could be the first letter if there aren’t
any “A” through “D” letters at the beginning; or it could be the last
letter if there’s no “F” or “G” at the end; or it could be anywhere in
between. Still, if we observe a “C” somewhere, and then later a “G”, and
after that an “I”, then the shift must have been 2, because those three
observations can only be explained if we assume the first came from the
<code class="language-plaintext highlighter-rouge">[A-D]*</code> subexpression, the second from the <code class="language-plaintext highlighter-rouge">E</code>, and the third from the
<code class="language-plaintext highlighter-rouge">[FG]*</code> subexpression.</p>
<p>In general, you can “encrypt” any regular expression using a given guess
about what the shift might be. In the last example above, encrypting
<code class="language-plaintext highlighter-rouge">[A-D]*E[FG]*</code> with a shift of 2 yields <code class="language-plaintext highlighter-rouge">[C-F]*G[HI]*</code>. If the new
expression matches the ciphertext, then that guess could be the right
answer. If you try every possible shift amount and only one guess
matches, then that guess must be the right answer.</p>
<h1 id="generalization-weighted-regular-expressions">Generalization: Weighted regular expressions</h1>
<p>Instead of running the regular expression matcher 25 times for each of
the 25 possible shift amounts in an English-alphabet Caesar cipher, it’d
be nice to run the matcher once and keep track of which possibilities
worked out. That’s especially important if your cipher has a larger state
space. 25 possibilities isn’t a big deal, but millions or more might be.</p>
<p>I’m going to show how to do this by adding support for “semiring
weights” to a regular expression matcher and by then defining an
appropriate type of weight. I learned about this approach from a paper
called “<a href="https://sebfisch.github.io/haskell-regexp/">A Play on Regular Expressions</a>”, which was
published at ICFP 2010. That is, to date, my favorite academic paper of
all time, and I highly recommend reading it, but I’m going to give a
different presentation of the key ideas here.</p>
<p>To begin with, let’s think about regular expression matching like
playing a guessing game. Consider this regular expression:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(AB|CD)*A
</code></pre></div></div>
<p>If the first character of input is an “A”, which “A” is it matching?
There are two possibilities: we might be expecting a “B” next, or the
end of the input. Let’s mark those two possible positions in this
regular expression with a dot:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(A·B|CD)*A·
</code></pre></div></div>
<p>Now if we have another character of input, then we can remove the dot at
the end: that possibility didn’t pan out. If that character is a
“B”, however, the first dot is still a valid guess and we should move it
forward, past the “B” in the pattern. There are three possibilities at
that point, and they are the same as the initial state: we’re expecting
either of the “A”s or a “C”.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(·AB|·CD)*·A
</code></pre></div></div>
<p>If we observe a “C” at this point, the corresponding dot is the only
valid guess. We can discard the other two and advance that dot past the
“C” in the pattern:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(AB|C·D)*A
</code></pre></div></div>
<p>In general, imagine keeping track, at each character in the regular
expression, of whether the next character of input is allowed to match
that character in the pattern. This mark is a boolean value, true or
false.</p>
<p>I’m hand-waving over one important detail: When matching input one
character at a time and propagating dots forward, you can keep moving
each mark along until you reach an actual letter (or character class).
At that point you have to stop and save the mark state for the next
round of matching, because you can’t tell whether that letter will match
until you read the next character of input. But no other variety of
pattern needs to store any state between rounds. (Short-circuiting state
for these other operators is equivalent to “epsilon closure” in the
theory of non-deterministic finite automata.)</p>
<h2 id="caesar-weights">Caesar weights</h2>
<p>Now let’s extend those boolean marks to have a separate boolean flag for
each “encrypted” variation of the pattern that we want to match. So
instead of one bit per letter in the pattern, we now have 25 bits per
letter. The first bit at each letter represents whether the pattern has
matched to that point given a Caesar shift of 1; the second bit is for a
shift of 2; etc.</p>
<p>Initially, before looking at any input, we should assume that all 25
possible shifts are valid candidates. Over the course of matching we’ll
narrow that set.</p>
<p>How should we propagate these Caesar cipher possibility bit-sets around
the regular expression while we’re matching input? Let’s consider one
regular-expression operator at a time.</p>
<p>For a subexpression which matches a single letter, we can compare the
next input character against the pattern and determine which shift
amount would match. If the pattern is <code class="language-plaintext highlighter-rouge">A</code> and the input is <code class="language-plaintext highlighter-rouge">D</code>, then
only a shift of 3 can match, so we can construct a bit-set where only
the third bit is set. But we know which shifts were valid possibilities
for earlier parts of the input, so we can bitwise-AND the prior bit-set
with the new one we just constructed. If the third bit was considered
valid before, then it’s the only valid option now; otherwise there are
no valid shifts for this part of the pattern at this point in the input,
and we can discard this possibility entirely.</p>
<p>How about the sequence <code class="language-plaintext highlighter-rouge">AB</code>? We have a mark point before “A” and another
before “B”, and each of those points holds a 25-element bit-set. The
marks before “A” get fed through the above process to produce a new
bit-set to be used at “B” next time; similarly, the marks before “B”
are updated and fed on to whatever follows this subexpression.</p>
<p>Alternation patterns like <code class="language-plaintext highlighter-rouge">AB|CD</code> take whatever mark came in from the
preceding expression, feed it into both alternative subexpressions, and
take whatever comes out and bitwise-OR them. In other words, any
candidate shift amount which matched either subpattern is a valid shift
amount for the pattern as a whole.</p>
<p>The “star” operator (<code class="language-plaintext highlighter-rouge">*</code>), matching zero or more copies of its
subpattern, has mark bit-sets coming from its preceding expression and
from its subexpression. The subexpression might match if either the
preceding expression, or a previous iteration of the subexpression, just
finished matching, so we feed the bitwise-OR of those two bit-sets in as
input to the subexpression. Similarly, the subsequent expression might
match if the preceding expression matches and the star operator matches
zero times, or if the subexpression outputs that it matched, so those
two bit-sets are also combined with bitwise-OR and fed onward to the
subsequent expression.</p>
<p>At this point we’ve only needed four operations for our Caesar weights:</p>
<ul>
<li>An initial value, where all shifts are considered valid</li>
<li>Bitwise-AND to narrow our set of guesses when concatenating
subexpressions</li>
<li>Bitwise-OR to track additional guesses in alternation and “star”</li>
<li>A function which, given an input character and an expected character
(class), returns the bit-set of valid shifts for that combination</li>
</ul>
<p>In terms from abstract algebra, the first three resemble a
“<a href="https://en.wikipedia.org/wiki/Semiring">semiring</a>”. Bitwise-AND works somewhat like multiplication on
numbers, while bitwise-OR works somewhat like addition, and the initial
value satisfies the same laws as the number “one”. To check whether the
regular expression failed to match you also need to check if a value is
like the number “zero”; in Caesar weights, that’s an empty bit-set.</p>
<p>I’m not convinced that all of the semiring laws are actually necessary
or desirable in this application. For example, requiring commutativity
does not seem to be necessary for either operator, and prevents some
more sophisticated uses. But I haven’t yet convinced myself of which
laws are actually required.</p>
<h1 id="related-work">Related work</h1>
<p>With the observation that marks can be moved forward until encountering
a literal character in the pattern, the matching algorithm as I’ve
described it resembles <a href="https://en.wikipedia.org/wiki/Glushkov%27s_construction_algorithm">Glushkov’s DFA construction
algorithm</a>, if you squint. “A Play on Regular Expressions”
explicitly describes the approach they follow as being Glushkov’s, but
it took me years to understand why. I hope my presentation here makes
the connection a little more clear.</p>
<p>“A Play on Regular Expressions” extends regular expression matching to
context-free grammars by allowing the regular expression to be a
lazily-constructed infinite tree. Matt Might <em>et al</em> subsequently wrote
a pair of papers doing the same for regular expression matching based on
<a href="https://en.wikipedia.org/wiki/Brzozowski_derivative">Brzozowski derivatives</a>: “<a href="https://matt.might.net/papers/might2011derivatives.pdf">Parsing with Derivatives</a>” at ICFP
2011, and “<a href="https://arxiv.org/abs/1604.04695">On the Complexity and Performance of Parsing with
Derivatives</a>” at PLDI 2016.</p>
<p>By 2019 I had confused myself with all of the above papers, leading to
this <a href="https://toot.cat/@jamey/101393064450390819">post</a>:</p>
<blockquote>
<p>tfw you mix up Glushkov’s construction with Brzozowski’s derivatives,
amirite?</p>
<p>no? that’s just me?</p>
</blockquote>
<p>That post led Magnus Holm to create a neat parser generator for
context-free grammars called <a href="https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler">Glush</a>. That’s probably the most value
for the world of any shitpost I’ve written.</p>
<p>The “dot” notation I used to illustrate marks in this post is the same
as notation commonly used to describe <a href="https://loup-vaillant.fr/tutorials/earley-parsing/">Earley parsing</a>. I believe, but
have not yet demonstrated, that Earley parsing can be extended with
semiring weights in the same way that “A Play on Regular Expressions”
did for Glushkov’s construction. Such a parser should be in the same
complexity class as Earley parsers: linear-time when used on regular
languages; quadratic on unambiguous context-free grammars; and
worst-case cubic on ambiguous context-free grammars. As a result my
hypothesis is that the ideal implementation of a weighted parsing
library would be based on the Earley algorithm, subsuming the need for a
regex-only library. Proof of this hypothesis is left as an exercise for
the reader.</p>
<p>Key tools for improving matching performance in regular expressions are
conversion to a <a href="https://en.wikipedia.org/wiki/Deterministic_finite_automaton">deterministic finite automaton</a> (DFA), followed by
DFA minimization. Determinizing and minimizing weighted automata is
trickier, making it less obvious how to get optimal performance from a
regular expression matcher that has been extended this way. One paper I
was reading recently which might help is “<a href="https://arxiv.org/abs/2204.12368">Fast Coalgebraic Bisimilarity
Minimization</a>” by Jules Jacobs and Thorsten Wißmann.
Additionally, they have a proof of concept Rust implementation called
<a href="https://github.com/julesjacobs/boa">Boa</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>It’s possible to parameterize a generic regular expression matcher over
semiring-like operations. Simple boolean marks have a trivial
implementation of the operations and allow simply checking whether the
regular expression matched or not, just like any other regular
expression matcher.</p>
<p>But by offering a parameterized interface, a regexp matching library can
support some pretty wild use cases that are impractical or impossible
with the regexp APIs in wide use today.</p>
<p>I’ve given one example: evaluating a regular expression that was written
in terms of plaintext against a ciphertext in order to identify which
cipher state can explain the observed ciphertext.</p>
<p>Another example which I won’t go into detail on today is incorporating
the common notion of “capture groups” into a simpler matcher algorithm
by representing them using appropriate weights. This requires a small
extension to the interface to allow weights on epsilon transitions, not
just when matching input. For some applications it also requires giving
the developer more control over how ambiguous patterns are interpreted,
via “greedy” versus “non-greedy” variants of operators, which in turn
requires a notion of “match priority” that doesn’t fit in the framework
of “A Play on Regular Expressions”.</p>
<p>Fortunately, weighted regular expressions are pretty easy to retrofit
into the efficient matching algorithm described in “<a href="https://swtch.com/~rsc/regexp/regexp2.html">Regular Expression
Matching: the Virtual Machine Approach</a>”, by Russ Cox.</p>
<p>I’ve written a demo implementation of the Pike-style VM approach with
capture groups, as described by Cox, in a Rust crate called <a href="https://github.com/jameysharp/pikevm">pikevm</a>.
I’ve also written a demo of weighted regular expression matching in a
Pike-style VM, in an independent Rust crate called <a href="https://github.com/jameysharp/parsink">parsink</a>.
Combining the two and making them into a robust and usable library is
left as future work.</p>
<p>I hope this post has broadened your view of the situations where regular
expressions might be a good tool. And I especially hope to see new, more
flexible, tools emerge for extracting information from text based on
formal language theory.</p>Jamey SharpSome people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie ZawinskiZ3 adventures and fast Boolean matching2021-09-05T18:14:33-07:002021-09-05T18:14:33-07:00https://jamey.thesharps.us/2021/09/05/z3-boolean-matching<p>Over the past few weeks I’ve been fiddling with <a href="https://github.com/Z3Prover/z3">Z3</a>, a theorem prover
and <a href="https://en.wikipedia.org/wiki/Boolean_satisfiability_problem">SAT</a>/<a href="https://en.wikipedia.org/wiki/Satisfiability_modulo_theories">SMT</a> solver from Microsoft Research. Z3 makes it pretty
easy to write down some NP-complete problem instance and then push a
button to get the answer. (Well, hopefully. I mean, they’re NP-complete
problems, it’s hard.) It’s also open source (released under an MIT
license), and, as I’ve learned, it has a delightful maintainer in
Nikolaj Bjørner.</p>
<p>Today I’m going to first briefly tell the story of my recent adventures
with Z3; then the bulk of this post will be a deep dive into testing
equivalency of Boolean functions, and the experiments those adventures
have motivated me to try.</p>
<h1 id="z3-adventures">Z3 adventures</h1>
<p>At first I was just playing with Z3’s Python bindings, but I started
generating problem instances where Z3 took much longer and used much
more memory than if I just exhaustively tested every possible solution,
so I dug deeper into what it was doing. It turns out that Z3’s algebraic
simplifier was actually making my particular bit-vector problems way
more complicated, and bypassing that to just use the so-called
“bit-blaster” and SAT solver got me results in reasonable amounts of
time.</p>
<p>However, although the bit-blaster had support for bitwise OR and bitwise
NOT operations on bit-vectors, it was missing support for bitwise AND.
People wouldn’t usually notice because the simplifier rewrites AND in
terms of combinations of NOT with OR, but since I was trying to avoid
the simplifier I hit this unusual corner.</p>
<p>I had to dig into the source code to figure out that the error I was
getting was about the AND operator, and in the process I found that it
didn’t look too hard to fix. So I wrote and submitted my first two Z3
pull requests—one to add the four lines of code needed, and
another to improve the error message in case somebody else ran into a
different missing operator. Nikolaj merged both patches impressively
promptly.</p>
<p>But that experience got me looking more closely at how Z3’s bit-blaster
works, and I found some unreachable code in its implementation of
multiplication. I started investigating how hardware engineers design
digital circuits for multiplication, and came up with a patch that
completely rewrote the bit-blaster’s multiplier. Naturally, there’s a
lot I don’t understand about how Z3 works and so I had some bugs, but
Nikolaj was again very helpful and responsive.</p>
<p>When Nikolaj tested my patch’s performance across the SMT-LIB benchmark
suite for “quantifier-free bit-vector” (<code class="language-plaintext highlighter-rouge">QF_BV</code>) problems, it showed an
average 20% speedup across those benchmarks, which I was pretty excited
about! But it also made some of the benchmarks slower by orders of
magnitude, so Nikolaj politely encouraged me to dig into it more
carefully and offered tips that got me unstuck several times.</p>
<p>On some of the benchmarks that got much worse with my multiplier, after
a lot of work it turned out that the only significant difference between
the two implementations was that mine had some extra NOT gates floating
around inside the tree of Boolean expressions. They were logically
equivalent, in that pairs of these extra NOT gates would effectively
cancel each other out, but that was pushing Z3 out of its comfort zone.</p>
<p>At that point I was back to looking at Z3’s simplifier. I made a small
change to how it handled NOT gates mixed into trees of XOR gates, such
as the bulk of the adders in a multiplier, and suddenly those benchmarks
were fast again.</p>
<p>Nikolaj wanted to see the effect of that change independently of the
rest of my changes and wrote a much more comprehensive patch. That
change by itself made the <code class="language-plaintext highlighter-rouge">QF_BV</code> benchmarks an average of almost <em>40%</em>
faster, improving some benchmarks by an order of magnitude and reducing
the number of benchmarks that couldn’t finish within 10 minutes by 10%.</p>
<p>It turned out that with the simplifier fixed, my rewritten multiplier
was actually <em>slower</em>, on average, than the old version, but it happened
to generate cases the old simplifier was good at in enough benchmarks to
bring up the average.</p>
<p>I’ll be honest, I was disappointed, because my multiplier was <em>pretty
cool</em>, if you think multiplication is cool. But as consolation prizes
go, I’m definitely satisfied by getting to say that I found a 40%
speedup opportunity in a tool that’s already as tuned for performance as
Z3 is. I’m also very pleased with the experience of working with Nikolaj
on this. We did fix up the unreachable code I’d originally spotted, too.</p>
<p>And then Nikolaj nerd-sniped me by suggesting that there could be even
bigger gains from replacing parts of the algebraic simplifier with
something based on “cut-set enumeration”, which I’d never heard of. I’ll
summarize it as splitting a complicated Boolean function into every
possible combination of simpler functions and deciding which of those
combinations is best. So now I’ve spent the last week reading up on that
topic and “logic synthesis”, and thinking very hard about Boolean truth
tables.</p>
<p>The truth table stuff is what I want to share next, partly because I’m
still thinking about the cut-set stuff. But mostly, I needed to know how
much compression is possible of a lookup table keyed by Boolean
functions, before I could fully evaluate whether anyone’s methods for
splitting a complicated Boolean function into many simple ones seemed
worth doing.</p>
<h1 id="finding-functionally-equivalent-boolean-circuits">Finding functionally equivalent Boolean circuits</h1>
<p>There are 2^N rows in any truth table for a circuit with N inputs. That
means that the number of possible truth tables for circuits with N
inputs is 2^(2^N). So there are 256 possible 3-input truth tables,
65,536 4-input truth tables, and more than 4 billion 5-input truth
tables. These numbers grow way too quickly.</p>
<p>Fortunately, a huge fraction of those truth tables are equivalent up to
a few simple transformations, called “NPN” classes: Negating some of the
inputs, Permuting the inputs, and Negating the output. The 65,536
4-input truth tables only have 222 NPN equivalence classes, so if you
have a procedure which can take any truth table and tell you which NPN
class it’s in, then you’ve dramatically reduced the number of cases you
have to handle.</p>
<p>For example, I’m looking into precomputing and storing optimal Boolean
circuits implementing each of those 222 representative 4-input truth
tables, even if doing the same for all 65,536 possible truth tables
takes too much space.</p>
<p>Actually, I claim there are only 208 “interesting” 4-input truth tables;
the other 14 ignore one or more of the inputs, so they’re equivalent to
a smaller truth table. I want to always use the smallest truth table I
can to represent each function. That said, you can store all sizes in
the same lookup table by adding ignored inputs back on the left, if you
want to.</p>
<p>The paper I’ve found most helpful in understanding these ideas is <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.642.9381">“Fast
Boolean Matching Based on NPN Classification”</a> by Zheng Huang <em>et
al</em>. (I found it from the citations in <a href="https://msoeken.github.io/papers/2016_fpl.pdf">“Fast Hierarchical NPN
Classification”</a> by Petkovska <em>et al</em>, which is great but
assumes you understand the previous work in this area first.)</p>
<p>The problem is that for a given N-input truth table, there are
N!*2^(N+1) different ways to negate and permute it. To identify which
NPN class a truth table belongs to, you need some deterministic
procedure that always picks the same representative truth table among
the equivalent possibilities. For larger N, checking all those
possibilities gets expensive.</p>
<h1 id="heuristics">Heuristics</h1>
<p>The good news is that there are fast heuristics which are almost as good
as checking all the possibilities. Instead of identifying a single
representative truth table for every element of the NPN class, different
truth tables within the class may lead to any of a few different
“semi-canonical” representatives. Among the interesting 4-input truth
tables, where there are 208 NPN classes, the heuristics in “Fast Boolean
Matching” identify 1,630 representatives, and I found easy tweaks that
reduce that to 812 representatives.</p>
<p>But I didn’t have any idea why any of the heuristics should work until I
spent several days fiddling with my own implementation of them, so
here’s what I learned.</p>
<h2 id="truth-tables-as-numbers">Truth tables as numbers</h2>
<p>First, note that they’re interpreting each truth table as a number. If
you already know how this works, bear with me; I’m going to be explicit
about the parts I had to keep thinking through.</p>
<p>Truth tables are a way to describe the behavior of a Boolean function,
where you write out every combination of the possible inputs as well as
what the output of the circuit is for each combination. So “r = if C
then T else F” has this truth table:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C T F | r
0 0 0 | 0
0 0 1 | 1
0 1 0 | 0
0 1 1 | 1
1 0 0 | 0
1 0 1 | 0
1 1 0 | 1
1 1 1 | 1
</code></pre></div></div>
<p>We can treat the result column as a single unsigned number written out
in binary, where the inputs are used as an index into the bits of the
number. If the all-zeros input is treated as the least-significant bit
of the number, then this if-then-else table is represented by 0xCA. I
believe the paper used the opposite convention, although it wasn’t
explicitly stated, so they would represent this table as 0x53.</p>
<p>Thinking of truth tables as natural numbers means we can impose a total
order on them, and that suggests one way we could define the
representative for an NPN class: pick the lowest-numbered truth table
from that class. (Picking the highest number works too, or any minimal
or maximal element of some other total order, so long as we do it the
same way consistently.)</p>
<p>These two choices affect each other, in that the notion of which truth
table is numerically less than another one depends on how you turn it
into a number. I got mediocre results until I started mentally swapping
“negative cofactor” and “positive cofactor” everywhere I saw them in the
paper, which is why I think they picked the opposite bit-order
convention compared to the one that makes sense to me.</p>
<h2 id="output-polarity">Output polarity</h2>
<p>The first heuristic decides whether to negate the output of the truth
table. It does this by choosing whichever way around has fewer 1s in the
result column of the truth table.</p>
<p>If we want to pick numerically smaller truth tables, to a first
approximation that means we want as many 0s in the most-significant bits
as we can get. Negating and permuting the inputs allows us to rearrange
what order the rows of the truth table are in, but negating the output
is the only way to change how many 1s and 0s there are total. So it
makes sense to try to minimize the total number of 1s first.</p>
<p>If the truth table has the same number of 0s as 1s, such as in the
if-then-else table above, then the paper says something I don’t quite
understand about trying both choices. I’ve chosen instead to break ties
by picking whichever option has the minimum numeric value.</p>
<h2 id="input-polarity">Input polarity</h2>
<p>The next heuristic decides, for each input, whether it’s better to
negate that input or leave it alone. It does this by choosing whichever
way around has fewer 1s in the “positive cofactor” than in the “negative
cofactor”. (Actually the paper has that the other way around, but see
above.)</p>
<p>I didn’t know what “cofactor” meant. I’ll go into more detail below in
the “implementation details” section, but the essential idea here is:
Count how many rows have a 1 in the result when input K is 1, versus how
many when the same input is 0. These are the 1s in the positive
cofactor, or negative cofactor respectively, with respect to input K.
Negating an input swaps its positive and negative cofactors.</p>
<p>Again, we’re trying to get numerically-lower tables, so the plan here is
to move some of the 1s from the most-significant bits toward the
least-significant bits.</p>
<p>The handy thing about this heuristic is that negating an input doesn’t
change the number of ones in the cofactors of any other input, so we can
apply this heuristic on the inputs in any order and get the same result.</p>
<p>If an input has the same number of 1s in both cofactors, the paper says
to leave that input unchanged. I tried that, but then I also tried
breaking ties by picking whichever option has the minimum numeric value.
I found that this tie-breaker cut the number of semi-canonical
representatives in half, which is awesome for something that’s so quick
to check.</p>
<h2 id="input-permutation">Input permutation</h2>
<p>The final heuristic picks an order for the inputs, once again by
comparing the number of 1s in the cofactors. In the previous step we
established that every input has no more 1s in its positive cofactor
than in its negative cofactor.</p>
<p>Now we sort the inputs so that the input with the fewest 1s in its
positive cofactor is the left-most input. (Again, the paper says to use
the negative cofactor instead, due to that different choice of bit-order
conventions.)</p>
<p>Similar to the previous step, swapping inputs doesn’t affect the
cofactors of any other inputs, so this can be done in any order while
getting the same result.</p>
<p>This one is hardest for me to explain. Let’s bring back a 3-input truth
table and look at the positive cofactors visually:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
</code></pre></div></div>
<p>The left-most input has its positive cofactor entirely in the bottom
half of the table. The middle input has its positive cofactor split into
two groups of rows, where the bottom half is still at the very bottom of
the table, but the other part is higher up. By contrast, the right-most
input spreads its positive cofactor as far across the table as
possible.</p>
<p>We’re trying to get numerically-smaller truth tables, which means moving
1s from the most-significant (bottom) rows toward the least-significant
(top) rows. So any input which has a lot of 1s in its positive cofactor
should be more toward the right, thereby spreading those 1s toward the
top of the table.</p>
<p>That said: After the second heuristic, if there are a lot of 1s in the
positive cofactor, then there are at least as many in the negative
cofactor, and moving a column to the right spreads <em>those</em> 1s toward the
bottom of the table. So figuring out whether this helps is complicated.</p>
<p>So when some inputs have the same number of 1s in their positive
cofactors, I break ties by sorting in <em>decreasing</em> order by number of 1s
in the negative cofactors. On the 4-input truth tables, this improved
the results from 868 semi-canonical classes to 812, and I expect it has
more effect on larger tables.</p>
<h2 id="evaluation">Evaluation</h2>
<p>These three heuristics don’t quite get us to the original goal of a
minimum numeric value truth table. One example illustrating this is
that, if the first two steps left a 1 in the most-significant (bottom)
row, there is no permutation of the inputs which will move that row.</p>
<p>Still, in my experiments with 3-input truth tables, there are 10 NPN
classes that don’t ignore any inputs, and only 13 semi-canonical classes
after applying these three heuristics, so the heuristics are almost
perfect at that size. On 4-input truth tables, there are 208 interesting
NPN classes and 812 semi-canonical classes, which is still small enough
to be a reasonable table-lookup. I didn’t write my prototype efficiently
enough so I haven’t had the patience to exhaustively check larger truth
tables yet.</p>
<h1 id="implementation-details">Implementation details</h1>
<p>If you want to go implement this idea, here’s a bunch of stuff I figured
out, about the details that academic papers leave out as exercises for
the reader.</p>
<p>There are various useful operations we can do on truth tables
represented as numbers.</p>
<p>Note that I number inputs starting from 0 on the right, so in the
if-then-else truth table, F is input 0, T is input 1, and C is input 2.</p>
<h2 id="add-ignored-inputs">Add ignored inputs</h2>
<p>One operation is to extend an N-input truth table to an N+1-input truth
table, where the new input is ignored. For example, the 1-input truth
table for “NOT A” is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A | !A
0 | 1
1 | 0
</code></pre></div></div>
<p>If we introduce a new ignored input “B” to the left of A, then the truth
table for “NOT A” becomes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>B A | !A
0 0 | 1
0 1 | 0
1 0 | 1
1 1 | 0
</code></pre></div></div>
<p>This is just duplicating the original truth table. On the numeric
representation, this can be computed efficiently as:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">table</span> <span class="o">|</span> <span class="p">(</span><span class="n">table</span> <span class="o"><<</span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="n">N</span><span class="p">))</span>
</code></pre></div></div>
<p>Applying this operation repeatedly allows adding any number of ignored
inputs.</p>
<h2 id="evaluate-cofactors">Evaluate cofactors</h2>
<p>The “negative cofactor” of a Boolean function, with respect to input K,
is the subset of that function’s outputs where input K is false.
Similarly, the positive cofactor is when input K is true.</p>
<p>A bitmask that is 1 everywhere input K is 0 can be quickly computed as
the truth table for the circuit “NOT K”, which helps in looking at the
negative cofactors of a truth table. (The reverse, for the positive
cofactors, is the bitwise-NOT of that bitmask, or equivalently the
bitmask shifted left by 2^K bits, whichever is more convenient to
compute.)</p>
<p>If K is the left-most of the N inputs in this truth table—and
assuming arithmetic is being done in two’s-complement, as all current
computers do—then its negative cofactor bitmask can be efficiently
computed as:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="n">K</span><span class="p">))</span> <span class="o">-</span> <span class="mi">1</span>
</code></pre></div></div>
<p>If K is not the left-most input, then build the bitmask for a truth
table where K <em>is</em> the left-most input, and then follow the above
procedure to add ignored inputs for each input to K’s left.</p>
<p>We’re going to use these negative cofactor bitmasks a lot, so as a
practical matter it may be worth precomputing these masks for every size
of truth table you plan to deal with, and every input at each size.</p>
<p>Actually, as long as you’re careful to ignore bits outside of your
current size of truth table, you can use any mask for size N with
smaller truth tables, so you may just need one bitmask for each possible
input at your largest table size.</p>
<h2 id="count-1s">Count 1s</h2>
<p>Counting the number of bits which are set within a word is a common
enough operation, sometimes called “population count” or “popcount”,
that x86 CPUs have an instruction specifically for this purpose. There
are also <a href="https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel">efficient bitwise-parallel methods</a> which, like many
“bit twiddling” hacks, are quite puzzling until you figure out the
trick.</p>
<p>Anyway, counting the 1s in a cofactor of an input is just a matter of
bitwise-AND between the truth table and the appropriate bitmask, then
applying a suitable popcount implementation.</p>
<h2 id="swap-subsequences-of-rows">Swap subsequences of rows</h2>
<p>Several of the more complex things we can do to a truth table can be
described as re-ordering the table’s rows. In general this would be a
tedious thing to do to our numeric representation: the usual bitwise and
arithmetic operators don’t allow swapping bits in arbitrary
permutations.</p>
<p>Happily, we’re only going to need certain kinds of re-ordering, such as
“swap rows 1 and 3 with rows 4 and 6,” and we can do that efficiently.</p>
<p>This procedure will take two parameters: a bit-mask for the
less-significant bits to swap, and an offset for how many bits to shift
that mask left to find the more-significant bits to swap with. In the
example above, the mask is 0b1010 (identifying rows 3 and 1), and the
offset is 3 (row 1 swaps with 4, and 3 swaps with 6, so both are 3 rows
apart).</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tmp</span> <span class="o">=</span> <span class="p">(</span><span class="n">table</span> <span class="o">^</span> <span class="p">(</span><span class="n">table</span> <span class="o">>></span> <span class="n">offset</span><span class="p">))</span> <span class="o">&</span> <span class="n">mask</span><span class="p">;</span>
<span class="n">table</span> <span class="o">^=</span> <span class="n">tmp</span> <span class="o">|</span> <span class="p">(</span><span class="n">tmp</span> <span class="o"><<</span> <span class="n">offset</span><span class="p">);</span>
</code></pre></div></div>
<p>I based this on <a href="https://graphics.stanford.edu/~seander/bithacks.html#SwappingBitsXOR">“Swapping individual bits with XOR”</a> from
Sean Eron Anderson’s page of “Bit Twiddling Hacks”. I’ve extended Sean’s
procedure to handle non-contiguous subsequences of bits, and to hide one
of the offsets in the other two parameters, because it turns out to be
more convenient that way for this.</p>
<h2 id="negate-the-circuits-output">Negate the circuit’s output</h2>
<p>If we want to take the numeric representation of a truth table and get a
table representing the same Boolean function but with the output
inverted, just bitwise-NOT the table. In C-like languages,</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">~</span><span class="n">table</span>
</code></pre></div></div>
<p>If you’re storing truth tables in machine words that are wider than the
size of the table, you may want to clear any bits that aren’t part of
the table after you do this.</p>
<h2 id="negate-the-circuits-inputs">Negate the circuit’s inputs</h2>
<p>We’re going to want to take a truth table and get a table representing
the same Boolean function but with one of the inputs inverted. Another
way to say that is we’re swapping the negative and positive cofactors of
that input. So we can combine two earlier procedures to do this.</p>
<p>In this case, the mask is the negative cofactor bitmask for input K, and
the offset is:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">1</span> <span class="o"><<</span> <span class="n">K</span>
</code></pre></div></div>
<p>We could pass that mask and offset to the procedure for swapping rows,
but in this case where every row is moving, it’s probably more efficient
to do this instead:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">((</span><span class="n">table</span> <span class="o">&</span> <span class="n">mask</span><span class="p">)</span> <span class="o"><<</span> <span class="n">offset</span><span class="p">)</span> <span class="o">|</span> <span class="p">((</span><span class="n">table</span> <span class="o">>></span> <span class="n">offset</span><span class="p">)</span> <span class="o">&</span> <span class="n">mask</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="swap-two-of-the-circuits-inputs">Swap two of the circuit’s inputs</h2>
<p>When swapping inputs J and K, the rows where both inputs are equal don’t
change. We just need to swap the rows where J is 0 and K is 1 with the
rows where J is 1 and K is 0.</p>
<p>Assume J is the left-most variable of the two; if it isn’t, run this
procedure with the parameters the other way around.</p>
<p>The mask for this swap is the bitwise-AND of the negative cofactor mask
for J (the rows where J is 0) with the positive cofactor mask for K (the
rows where K is 1).</p>
<p>The offset is the difference between the index of the first 1 in J’s
column and the first 1 in K’s column in the truth table. This is:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="n">J</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="n">K</span><span class="p">)</span>
</code></pre></div></div>
<p>For 3-input functions like the if-then-else example truth table above,
there are three possible input swaps.</p>
<ul>
<li>Swap C with F: mask is 0b00001010, offset is 3</li>
<li>Swap C with T: mask is 0b00001100, offset is 2</li>
<li>Swap T with F: mask is 0b00100010, offset is 1</li>
</ul>
<p>In general there are N*(N-1)/2 possible input swaps for N inputs.</p>
<p>Note that any permutation of a circuit’s inputs can be described as a
series of pairwise swaps, so applying this procedure multiple times
allows arbitrary permutations.</p>
<h2 id="combining-truth-tables">Combining truth tables</h2>
<p>Say you have the truth tables for arbitrary circuits A and B, and you
want the truth table for an OR gate with the outputs of A and B as its
inputs.</p>
<p>If A and B have the same inputs in the same order, then simply combining
the numeric representation of their truth tables with bitwise-OR does
the trick. The same works for any bitwise operator such as AND or XOR.
(This also works for operators with three or more inputs, like
if-then-else, but there aren’t any bitwise ternary operators in common
programming languages so that’s not super helpful.)</p>
<p>If A and B have different inputs, though, first you need to fix that.
You can use earlier procedures to add ignored inputs and then reorder
the inputs until both circuits match. This is probably easiest if you
impose a total order on your inputs and keep all your truth tables’
inputs sorted according to that order.</p>
<h2 id="detecting-and-removing-ignored-inputs">Detecting and removing ignored inputs</h2>
<p>Every additional input doubles the size of a truth table, so it’s useful
to remove any input from a Boolean function so long as doing that
doesn’t change the function’s behavior. This is an especially good idea
after you’ve combined truth tables as above, because many Boolean
algebraic simplifications fall out automatically. For example,</p>
<ul>
<li>
<p>When the input circuits compute <code class="language-plaintext highlighter-rouge">X XOR Y</code> and <code class="language-plaintext highlighter-rouge">X XOR Z</code>, combining
their truth tables with XOR produces the same table as <code class="language-plaintext highlighter-rouge">Y XOR Z</code>, with
X ignored.</p>
</li>
<li>
<p>When the input circuits compute <code class="language-plaintext highlighter-rouge">X</code> and <code class="language-plaintext highlighter-rouge">NOT X</code>, combining them with
AND produces the constant FALSE, and combining with OR produces the
constant TRUE; either way, X is ignored.</p>
</li>
</ul>
<p>A simple way to check for this condition is to try negating some input
using the earlier procedure. If the truth table doesn’t change, then
that input has no effect on the function.</p>
<p>Here’s a slightly more efficient method to check if input K is ignored,
given the negative cofactor bitmask for K:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">table</span> <span class="o">^</span> <span class="p">(</span><span class="n">table</span> <span class="o">>></span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="n">K</span><span class="p">)))</span> <span class="o">&</span> <span class="n">mask</span> <span class="o">==</span> <span class="mi">0</span>
</code></pre></div></div>
<p>If input K is ignored, you can swap it with the left-most input, then
discard the most-significant half of the table, which at that point will
be equal to the least-significant half. This is the reverse of the
earlier procedure for adding an ignored variable.</p>
<h2 id="enumerating-all-npn-classes">Enumerating all NPN classes</h2>
<p>I started investigating this topic partly because I wanted to know how
much I could compress a list of truth tables, and one way to frame that
question is “how many NPN classes are there?” So I needed a way to count
them. But when I learned that finding a perfect representative of an NPN
class can be slow, then I wondered how close a heuristic approximation
can get, so I needed to count those too.</p>
<p>Earlier I mentioned <a href="https://msoeken.github.io/papers/2016_fpl.pdf">“Fast Hierarchical NPN Classification”</a>
by Petkovska <em>et al</em>, which is all about optimizations you can apply if
you’re trying to find the canonical NPN class for many different truth
tables. That’s definitely applicable to the case of counting all NPN
classes, but if you do it exactly as they describe then their method
needs space proportional to the number of possible truth tables, which
is prohibitive for even 5-input tables.</p>
<p>I came up with a different method suited specifically for this counting
task, similar to the Sieve of Eratosthenes for finding prime numbers.
This relies on a fast implementation of the three heuristics.</p>
<p>Since we’re picking the numerically-smallest truth table as the
canonical representative of its NPN class, start with the truth table
with numeric value 0 and check successively larger truth tables.</p>
<p>Skip any truth table that ignores any of its inputs—including
table 0, which is the constant FALSE and ignores all of its inputs. This
is an especially fast test, so do it first. You don’t have to do this,
but the input permutation heuristic as I’ve described it here won’t
notice that the ignored inputs don’t matter, so these cases may have
more semi-canonical representatives than they would if you ran this same
process on the smaller table size with the ignored inputs removed.</p>
<p>Next, check whether this truth table is a semi-canonical class
representative. If it isn’t, then it definitely is not a canonical NPN
class representative, so skip it. The easiest way to check this is to
run the truth table through the heuristics and see whether they return
the same truth table unchanged.</p>
<p>Otherwise, when this truth table <em>is</em> a semi-canonical class
representative, we need to check whether we already saw any smaller
semi-canonical representative in the same class. There are two ways to
do this, depending on whether you’d rather save memory or save time:</p>
<ul>
<li>
<p>We could generate all negations/permutations of this truth table and
check whether any of them are less than the current candidate, which
is slow but makes the algorithm use a constant amount of memory. (I
think?)</p>
</li>
<li>
<p>When we definitely have a canonical NPN class representative, we can
generate all negations/permutations (which by definition are all
greater than or equal to this truth table), filter them down to only
the semi-canonical ones, and add those to a set of known non-canonical
truth tables. Then, a later candidate is canonical if it is not in
that set, and we can remove elements from the set as we pass them.
This is faster but it’s hard to put bounds on its memory usage.</p>
</li>
</ul>
<h1 id="conclusion">Conclusion</h1>
<p>I’ve had a great time exploring how Z3 works and interacting with its
maintainer via GitHub, and I’m learning a lot.</p>
<p>I’ve also learned a ton in the last few days about methods for
identifying Boolean functions that are equivalent to each other. I’ve
read three papers on this subject, and there are… more, so I’ve
probably only scratched the surface.</p>
<p>But just trying to understand those papers required a lot of background
I didn’t have, so in this post I’ve tried to distill all that
information into a summary that might help someone else get started.</p>
<p>In addition, I’ve described some improvements I found that weren’t in
the papers I read, and many bit-twiddling hacks to make a practical
implementation as fast as possible.</p>Jamey SharpOver the past few weeks I’ve been fiddling with Z3, a theorem prover and SAT/SMT solver from Microsoft Research. Z3 makes it pretty easy to write down some NP-complete problem instance and then push a button to get the answer. (Well, hopefully. I mean, they’re NP-complete problems, it’s hard.) It’s also open source (released under an MIT license), and, as I’ve learned, it has a delightful maintainer in Nikolaj Bjørner.How not to improve video conferencing2021-04-07T12:30:00-07:002021-04-07T12:30:00-07:00https://jamey.thesharps.us/2021/04/07/how-not-to-improve-video-conferencing<p>Over the last year, a lot more people have been using video-conferencing
platforms, such as Zoom, because they couldn’t safely meet in person.</p>
<p>I don’t like siloed platforms like Zoom. I don’t want to devolve into a
rant here, so I’ll just say: I believe that systems should always work
in a federated way if that’s feasible, like how people at <code class="language-plaintext highlighter-rouge">gmail.com</code>
can email people at <code class="language-plaintext highlighter-rouge">outlook.com</code> even though those are operated by
completely different companies. And we’ve had email-like standards
(<a href="https://en.wikipedia.org/wiki/Session_Initiation_Protocol">SIP</a>) for voice, video, and chat for over two decades; apparently,
cell phone carriers have been using them for most of that time, and the
current iteration (<a href="https://en.wikipedia.org/wiki/Voice_over_LTE">VoLTE</a>) is <a href="https://gsacom.com/paper/volte-update-october-2020-global-status/">actively deployed in 104
countries/territories</a> around the world. So it’s not
like we don’t know how to make the technology work.</p>
<p>Recently, I tried to counter my own biases and ask: is there a reason
that I consider justified to build a non-federated conferencing service?
In particular, what impact do architectural choices have on costs and
efficiency for the various participants?</p>
<p>While considering these questions, I dug through a number of research
papers on cryptography and on networking, to see whether academia can
offer improvements for video-conferencing that industry hasn’t picked up
yet.<sup id="fnref:theory" role="doc-noteref"><a href="#fn:theory" class="footnote">1</a></sup></p>
<p>I learned a lot, but ultimately what I can report today are <em>negative</em>
results. I’m sharing the ideas I considered in the hopes that you may be
inspired to apply some of these research results to some setting where
they’re more appropriate. In addition, I’m sharing the process I
followed to ultimately reject those ideas in the hopes that you may find
the reasoning helpful in your own investigations.</p>
<h1 id="secure-audio-conferencing">Secure audio conferencing</h1>
<p>Audio-only conference calls have an interesting property: it’s possible
to mix everyone’s audio into a single stream. With video calls, each
participant needs to receive a copy of every other participant’s video,
which imposes practical limits on how many people can join a video
conference. But no matter how many people are sending audio, if a
central server handles mixing all the audio streams together, then each
participant’s bandwidth usage is constant.</p>
<p>That sounds like a good justifiable reason to centralize audio-only
conference calls, at least.</p>
<p>But I don’t want that centralized service to be able to listen in on my
calls. So is there a way to send encrypted audio through the server and
let it mix the audio together but without being able to decrypt it?</p>
<p>The answer is yes! In fact while discussing this with some friends, one
of them reminded me that a team at Galois, where I used to work, built a
prototype of exactly this application during their work on
<a href="https://galois.com/project/secure-multi-party-communication/">ShareMonad</a>; they wrote about the experience in Launchbury <em>et al</em>’s
paper, <a href="https://galois.com/wp-content/uploads/2014/08/pub_JL-DWA-TD-EM_ApplicationScaleSecureMPC.pdf">Application-Scale Secure Multiparty Computation</a>.
Their approach requires that you have three servers which you trust
won’t share data with each other, which presumably means they’re
operated by independent entities and maybe in different legal
jurisdictions; as a result, the deployment story seems awfully
complicated. Also I think if they’d used the approach I’m about to
describe, their implementation would have been simple enough to not make
for an interesting academic paper. 😅</p>
<p>Mixing audio fundamentally comes down to taking a sample from each
source and adding them together. Because it’s just addition, there are a
lot of tricks we can play. A variant of the ShareMonad-style approach,
for example, splits a sample into multiple “shares”, where each share is
random but the sum of all the shares equals the original value. So each
client sends a different share of their samples to each server; each
server adds all the shares it received, resulting in one share of the
mixed result; and all the servers send their result shares back to the
clients, which add up the shares to find the final mixed
sample<sup id="fnref:sharemonad" role="doc-noteref"><a href="#fn:sharemonad" class="footnote">2</a></sup>. If all the servers pool together the shares they’ve
been given, then they can learn what the original inputs were; but if
just one server keeps its shares secret, then none of the servers learn
anything about whatever is being discussed on the call.</p>
<p>Another approach is to use <a href="https://en.wikipedia.org/wiki/Homomorphic_encryption">homomorphic encryption</a>. In an
additively-homomorphic cryptosystem, I can give you some encrypted
numbers, and there’s some procedure you can follow to find the
encryption of the sum of those numbers, even if I don’t give you the
decryption key. The most-used additively homomorphic cryptosystem was
published by <a href="https://en.wikipedia.org/wiki/Paillier_cryptosystem">Paillier</a> in 1999, and simplified and generalized by
<a href="https://en.wikipedia.org/wiki/Damg%C3%A5rd%E2%80%93Jurik_cryptosystem">Damgård and Jurik</a> in 2001. Much of the research since
then has been pursuing “Fully Homomorphic Encryption” which supports not
just addition but also multiplication (or other pairs of operators),
because there aren’t that many applications where just having addition
is enough. But audio conferencing is one of those rare exceptions!</p>
<p>Modern cryptosystems generally have a security parameter, where higher
numbers are expected to be secure against adversaries with more
computing power. In the case of Paillier, if the security parameter is
2048 bits, then the plaintexts it can encrypt are up to 2048 bits and
the resulting ciphertexts are 4096 bits.<sup id="fnref:expansion" role="doc-noteref"><a href="#fn:expansion" class="footnote">3</a></sup> This is important
because we want as high of a security parameter as we can afford, but at
reasonable security levels we can’t do very many encryptions or
decryptions per second, so we need to make the best use we can of each
one.</p>
<p>So we need to pack multiple samples into a single plaintext. If we’re
using 8-bit audio samples, then we can fit about 250 of them into a
2048-bit plaintext<sup id="fnref:max-plaintext" role="doc-noteref"><a href="#fn:max-plaintext" class="footnote">4</a></sup>. That said, if we add two audio
chunks packed this way together, any sample could overflow into the
least significant bit of the next sample. So we need to leave an extra
zero bit in between each sample. In general, if we have <code class="language-plaintext highlighter-rouge">N</code> participants
in the conference, then the number of padding bits we need is the base-2
log of <code class="language-plaintext highlighter-rouge">N</code>, rounded up. If we leave ourselves 10 bits for each 8-bit
sample, then we can have up to four participants and about 200 samples
per plaintext. Note that the server doesn’t need to know which bits are
padding, so the participants can negotiate amongst themselves regarding
the number of samples per plaintext and the number of bits per sample.</p>
<p>Up to this point I’ve assumed that the audio is uncompressed, but
there’s one very common audio compression trick that works just fine in
this setting: <a href="https://en.wikipedia.org/wiki/Modified_discrete_cosine_transform">MDCT</a>. Used by everything from AAC to WMA, the MDCT
transforms a sequence of time-domain audio samples (plus half as many
samples from before and after the current block) into an equal number of
frequency bins. The neat thing is that adding two blocks together and
then transforming to the frequency domain gives the same result as
transforming both blocks and then adding them. So a server performing
Paillier additions doesn’t even need to know whether the plaintexts
contain time-domain or frequency-domain representations. To compress the
transformed blocks, the participants just need to agree on how many bits
to use for each frequency bin. For example, they might choose to use
fewer bits for frequencies outside the normal range of human speech.</p>
<h2 id="estimating-feasibility">Estimating feasibility</h2>
<p>Okay, so let’s estimate how much compute power this plan takes. If we
assume 200 samples per plaintext and a sample rate of 8kHz, that means
each participant needs to do 40 encryptions and 40 decryptions per
second, and the server needs to do <code class="language-plaintext highlighter-rouge">40*(N-1)</code> additions per second for
<code class="language-plaintext highlighter-rouge">N</code> participants. To get a rough idea of how fast Paillier operations
are, I ran the microbenchmarks for the Rust <a href="https://crates.io/crates/paillier-common">paillier-common</a> crate on
my laptop, a 2015 model with a 3.1GHz Intel Core i7 CPU.</p>
<p>In my completely-unscientific testing, one encryption and one decryption
of 2048-bit plaintexts together take about 7ms, which is about 28% of
the 25ms we have before it’s time to process the next 200 samples. So my
laptop could probably handle this scheme, but a phone might not be able
to.</p>
<p>The Paillier additions are very fast though, by comparison. My laptop
CPU, which is presumably less powerful than a reasonable server, could
handle something over 4,000 participants simultaneously on one CPU core.</p>
<p>So this seems feasible and I think that’s a bunch of really cool
results, but…</p>
<h2 id="is-this-worth-doing">Is this worth doing?</h2>
<p>No. It’s good enough to send a separate audio stream for each person who
is speaking; that may even be a superior approach. The recipients can
mix the audio into a mono stream for playback, and the streams can be
end-to-end encrypted using bog-standard symmetric-key cryptography even
if they’re routed through a central server.</p>
<p>In a conference call, it isn’t practical to have many people talking
simultaneously, because then nobody can make out what anybody is saying.
As long as the maximum number of simultaneous speakers is small, the
amount of bandwidth required to listen in stays small no matter how many
listeners join.</p>
<p>Anyway, the amount of bandwidth needed for compressed audio is tiny.
High-quality lossy compression for speech needs maybe 40kbps. Given the
parameters I used above, my scheme would require 160kbps in each
direction, so as long as there are fewer than eight simultaneous
speakers, the simpler system is more efficient even with higher quality
audio. Using an MDCT and fewer bits for some frequency bins helps, but
it probably doesn’t make up that big of a difference.</p>
<p>I hope it’s clear that none of this means homomorphic encryption
schemes, like the Paillier or Damgård–Jurik cryptosystems, are useless.
Secure multiparty computation, like ShareMonad, isn’t invalidated as a
research field by this result either. But I do think that audio
conferencing isn’t going to be a fruitful application domain for any of
these tools, because its challenges are better addressed using more
straightforward solutions.</p>
<h1 id="efficient-peer-to-peer-video-conferencing">Efficient peer-to-peer video conferencing</h1>
<p>Unlike audio conferencing, there are no shortcuts in video conferencing:
if you want to send video from <code class="language-plaintext highlighter-rouge">N</code> people to some recipients, you need
to send <code class="language-plaintext highlighter-rouge">N</code> times as much video data. Anyway I just got done arguing
that the audio shortcut of mixing down to one stream isn’t actually
useful, so anything we might say about video applies to some extent to
audio too.</p>
<p>But maybe academic research can point the way to more efficient
distribution of that video data across the internet?</p>
<p>Before we can investigate that question, we have to decide what we’re
trying to optimize.</p>
<p>If you’re coming from the perspective of operating a video-conferencing
platform like Zoom, you might look at the costs associated with routing
all that video through your own servers and networks, and wonder whether
a more peer-to-peer system might save you money.</p>
<p>If your perspective is you don’t trust video-conferencing platforms to
keep your meetings secure, then you also might wonder whether you can
convince the people you communicate with to use a peer-to-peer system.</p>
<p>On the other hand, if you operate some portion of the public internet,
you might wonder whether there’s a way people could use less of your
network for video-conferencing, leaving more room for whatever it was we
did with the internet before the pandemic.</p>
<p>The bad news is that no matter what, there will be at least <code class="language-plaintext highlighter-rouge">N*(N-1)</code>
video streams bouncing around for a video-conference where all <code class="language-plaintext highlighter-rouge">N</code>
participants have their cameras turned on, because one way or another,
each participant has to receive one video stream from every participant
except themself. If the video is sent to a central server and
redistributed from there, the minimum is <code class="language-plaintext highlighter-rouge">N*N</code> streams.</p>
<p>Note that since every stream has both a sender and a receiver, if you
add up the bandwidth costs for all participants, they come to twice the
number of streams.</p>
<p>The good news is that we can shift the costs of those streams around,
and perhaps reduce the cost of the underlying infrastructure, by
organizing the participants into different structures.</p>
<h2 id="centralized-platforms">Centralized platforms</h2>
<p>The simplest way to engineer a video-conferencing platform is to have
every participant connect to a central server, which copies everyone’s
video stream to everyone else. This is nice for the participants, who
each need enough bandwidth for <code class="language-plaintext highlighter-rouge">N</code> video streams, which is the bare
minimum. On the other hand, that central server faces <code class="language-plaintext highlighter-rouge">N*N</code> video
streams, which places much harder limits on how many participants the
platform can handle in one video conference. I might be able to handle
one thousand simultaneous video streams on my gigabit
fiber<sup id="fnref:tiny-pictures" role="doc-noteref"><a href="#fn:tiny-pictures" class="footnote">5</a></sup>, but then that poor server is facing one <em>million</em>
streams for one conference.</p>
<p>The “one central server” plan also isn’t good for those folks operating
the public internet. Imagine I placed that server in Europe but everyone
in some conference call is in North America. Now there are <code class="language-plaintext highlighter-rouge">N*N</code> streams
crossing the expensive trans-Atlantic cables even though exactly 0 of
them are of interest to anyone on the European side of the ocean.</p>
<p>This isn’t the way any serious platform operates, clearly, but let’s set
aside the centralized platforms for a moment and go to the other
extreme.</p>
<h2 id="everyone-sends-directly-to-everyone-else">Everyone sends directly to everyone else</h2>
<p>The most straightforward way to make a video conferencing system
peer-to-peer is to require every participant to open a direct connection
to every other participant, with each sending the other their own video
feed.</p>
<p>This option is, hands down, the most fair way to distribute the
bandwidth costs: every participant needs enough bandwidth for <code class="language-plaintext highlighter-rouge">2*(N-1)</code>
streams. No more, no less. That’s up to twice what they needed in the
centralized model, but on the other hand there isn’t anybody getting
stuck with the dramatically higher <code class="language-plaintext highlighter-rouge">N*N</code> bandwidth costs.</p>
<p>Fairness isn’t necessarily a good thing, since participants often have
different amounts of bandwidth available. Fairness could mean limiting
the size of a video conference to the number of people that the
least-well-connected participant has bandwidth for.</p>
<p>That said, direct connections open up the option of sending video
encoded at a different bitrate for some recipients, at the cost of more
CPU time spent on video encoding. So that tradeoff might pay off.</p>
<p>This option also may offer the lowest end-to-end latency, since each
stream goes as directly as possible to its recipient. Or it may not, if
some platform builds out a global network optimized for latency and can
bypass networks that are optimized for throughput and cost. It’s fun to
look at e.g. <a href="https://www.peeringdb.com/net/8731">Zoom’s PeeringDB entry</a> and speculate about
what they’re doing with the peering exchange points that they have a
presence in on every continent.</p>
<p>On today’s internet, not everyone can connect directly to everyone else,
due mostly to <a href="https://en.wikipedia.org/wiki/Network_address_translation">NAT</a>. But using direct connections by default and
falling back to platform-provided <a href="https://en.wikipedia.org/wiki/Traversal_Using_Relays_around_NAT">TURN</a> relays still cuts out a lot
of the platform’s costs in the centralized model. In fact, operating a
global network of TURN relays is currently a useful service in its own
right; see, for example, <a href="https://www.twilio.com/stun-turn">Twilio’s Global Network Traversal
Service</a>. I just hope some day we don’t need such services any
more.</p>
<h2 id="overlay-multicast">Overlay multicast</h2>
<p>There are often clusters of participants in different geographic
locations—say, a few in San Jose, some others in Toronto, and
let’s rope in the team in London while we’re at it. Maybe what we should
do is pick one well-connected participant in each cluster who handles
the long-distance video forwarding to one or two other clusters, and
then that representative connects to everyone else in their cluster. If
that’s too much for one of those representatives, then we can split that
cluster into several smaller clusters, and the top-level representative
just has to connect to the mini-cluster representatives.</p>
<p>Any structure like this can be seen as a tree, although in this context
it doesn’t actually matter which node you think of as the root. Since
every node except the root has a single parent, and every participant’s
video is flowing either up or down the tree at every edge, this still
results in <code class="language-plaintext highlighter-rouge">N*(N-1)</code> video streams, just like when everyone sends
directly to each other. The difference is that an individual
participant’s bandwidth requirements are now <code class="language-plaintext highlighter-rouge">N*D</code>, where <code class="language-plaintext highlighter-rouge">D</code> is the
degree of that node, or in other words how many other participants it’s
connected directly to.<sup id="fnref:tree-examples" role="doc-noteref"><a href="#fn:tree-examples" class="footnote">6</a></sup></p>
<p>This lets us have a kind of “controlled unfairness”. Each leaf in the
tree only connects to its parent, so it needs enough bandwidth for <code class="language-plaintext highlighter-rouge">N</code>
streams, just like in the central server case. The interior nodes can
connect to as many other participants as they can afford. Higher degree
makes the tree more flat which decreases worst-case end-to-end latency
but, of course, increases bandwidth costs for that participant.</p>
<p>The problem is that participants have incentives that run counter to
doing what’s best for the group as a whole: in this context, they often
want to minimize their own bandwidth costs. We have a lot of choices for
how to structure the tree, but how do we know which ones people will
actually use?</p>
<p>While I was researching this topic, I found a paper called
<a href="http://www.afergan.com/research/papers/rgmo_infocom06.pdf">“Repeated-Game Modeling of Multicast Overlays”</a>, by
Afergan and Sami, published in the proceedings of IEEE INFOCOM 2006.
Their model was that only one source would be broadcasting and the root
of the tree would receive the broadcast first, so they consider an
additional incentive which is to be as close to the root as possible in
order to get the lowest latency. (In fairness, that’s what most people
mean by “multicast”; I’m abusing the terminology a bit because the same
structure works here.) Although that incentive doesn’t apply here, most
of their modeling and reasoning is still valid in this situation too.</p>
<p>I enjoyed reading this paper! It provided a convincing way to reason
about individual incentives, and turned the results into system design
principles for encouraging individuals to cooperate.</p>
<p>There are many other papers on the topic of “overlay multicast” which
suggest different ways to structure these trees, so anyone who wants to
go down this road has lots to work with. But…</p>
<h2 id="is-this-worth-doing-1">Is this worth doing?</h2>
<p>Specifically, should video conferencing avoid having a centralized
platform but also avoid having every participant connect directly to
each other?</p>
<p>This isn’t totally clear, but no, I don’t think so.</p>
<p>The first model, with a central server, is roughly the same as a tree
where the root has all other participants as its direct children. I find
this to be a useful way to think about how centralized platforms can
scale to multiple servers: organize the servers in a tree, and let
conference participants connect (as leaves of the tree) to whichever
server is geographically closest to them.</p>
<p>Keeping the interior nodes of the tree under the control of a
centralized platform means that all the incentives are aligned: the
operators of those nodes care about the sustainability of the network as
a whole, because it’s their network. They also have the ability to place
those nodes wherever they’re most needed, rather than relying on the
geographic distribution of the participants, and they have the ability
to control the quality of the connections between those nodes.</p>
<p>In the peer-to-peer setting, requiring every participant to connect
directly to each other doesn’t impose that much of a bandwidth cost. For
two participants the cost is exactly the same; for three, it’s 33%
higher. As the number of participants grows, the bandwidth cost that the
peer-to-peer participants have to shoulder tends asymptotically toward
twice the cost of using a central server.</p>
<p>Forming an overlay multicast tree is complicated because, as the
repeated-game paper showed, protocol and policy choices must be made
with the participants’ incentives in mind. Since the only advantage over
direct connections is that some nodes save a little bandwidth by pushing
part of their load onto others, I don’t think that complexity is
justified.</p>
<p>Most of these considerations would go out the window if IP multicast had
been widely deployed on the public internet. If the routing
infrastructure at the IP layer supported sending only one copy of a
stream across a particular link even when multiple recipients want it on
the other end, then I would say everybody should just use that and call
it a day. Every participant’s bandwidth costs would be from just <code class="language-plaintext highlighter-rouge">N</code>
video streams, like in the central server case but without any central
servers, and every video stream would cross any given network link
exactly once. But the key there is that it has to be integrated with the
underlying routing decisions, not bolted on as an application-level
overlay.</p>
<p>sigh… maybe someday we’ll get an internet that does everything,
but I’m not holding my breath.</p>
<h1 id="conclusions">Conclusions</h1>
<p>N-to-N audio/video conferencing is tricky to make scalable, but there’s
been significant demand lately. Given how much academic research hasn’t
found industry adoption, it’s worth checking whether there are hidden
gems that can drive the next big innovation. In this case, the research
I evaluated didn’t yield any of those, which always feels disappointing.</p>
<p>But I hope you’ve found the process I went through interesting, at
least, and perhaps picked up some tricks that you’ll be able to apply to
other challenges you encounter some day.</p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:theory" role="doc-endnote">
<p>Cryptography is a fascinating field with all sorts of strange and
wonderful theoretical results. So it seems a shame that out of that
entire field, the only things impacting most people’s life
day-to-day are public-key encryption, symmetric-key encryption, and
secure hash functions. Granted, these three tools <em>have</em> had a huge
impact; all three are necessary to make the little “lock icon” in
web browsers meaningful. (In addition, secure hash functions have a
huge impact on the world’s electricity consumption via proof-of-work
blockchains such as Bitcoin; most tools have both good and bad uses,
after all.)</p>
<p>More generally, there’s a lot of academic research out there that
has not found applications besides publishing more research papers.
That doesn’t mean it was a bad idea to do that research! It may mean
the right application for it just hasn’t appeared yet, or it may
inspire other research that’s more immediately applicable, or it may
just serve to show that people shouldn’t waste time doing that
particular thing again. All of those outcomes are valuable. <a href="#fnref:theory" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sharemonad" role="doc-endnote">
<p>The ShareMonad implementation was made more complicated than this
because they had the servers change between logarithmic and linear
representations as well as clamping the mixed samples back into
valid dynamic range. They also generate a different mix for each
client, excluding that client’s input while mixing all the other
streams together. But all that work can be done on the clients,
either before encrypting their individual samples or after
decrypting the mixed samples. The only work which requires seeing
samples from multiple clients is the addition step, and that part is
easy. <a href="#fnref:sharemonad" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:expansion" role="doc-endnote">
<p>The Damgård–Jurik cryptosystem supports a parameter called <code class="language-plaintext highlighter-rouge">s</code>,
where the plaintext can be <code class="language-plaintext highlighter-rouge">s</code> times as long as the security
parameter, and the ciphertext will be <code class="language-plaintext highlighter-rouge">s+1</code> times as long, for a
ciphertext expansion rate of <code class="language-plaintext highlighter-rouge">(s+1)/s</code>. Paillier is the special case
where <code class="language-plaintext highlighter-rouge">s</code> is 1, so the ciphertext expands by a factor of 2 compared
to the plaintext; higher values of <code class="language-plaintext highlighter-rouge">s</code> are more space-efficient but
a little slower per bit.</p>
<p>This was not clear to me from the Wikipedia articles on the subject,
so to get the whole story I read the paper <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.69.7811">“A Generalization of
Paillier’s Public-Key System with Applications to Electronic
Voting”</a> by Damgård, Jurik, and Nielsen.
That’s a fascinating paper in its own right, including a bunch of
useful zero-knowledge proofs that are relevant to electronic voting.
It also includes performance measurements for a 750MHz Pentium III,
which were not super helpful for my purposes. <a href="#fnref:expansion" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:max-plaintext" role="doc-endnote">
<p>Although 256 8-bit values fit in 2048 bits, a Paillier plaintext
must be numerically less than the public key, which is the product
of two 1024-bit prime numbers, and is therefore some number smaller
than <code class="language-plaintext highlighter-rouge">2^2048</code>. <a href="#fnref:max-plaintext" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tiny-pictures" role="doc-endnote">
<p>Yes, yes, I know, if I try to display 1,000 video streams on my
screen, they’ll all be so small as to be useless. An alternate title
for this post could have been “Asymptotic complexity isn’t
everything.” <a href="#fnref:tiny-pictures" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tree-examples" role="doc-endnote">
<p>I find it informative to consider some specific shapes that an
overlay multicast tree may take on:</p>
<p>In the extreme case where every node except the last has exactly one
child, the two nodes at the ends (the root and the leaf) have degree
1 and the longest possible delay from each other; everyone else has
degree 2, so they need only slightly more bandwidth than in the case
where everyone connects directly to each other.</p>
<p>If the tree is a balanced binary tree, then about half the
participants have degree 1, while the rest have degree 3, except for
the root which has degree 2. <a href="#fnref:tree-examples" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Jamey SharpOver the last year, a lot more people have been using video-conferencing platforms, such as Zoom, because they couldn’t safely meet in person.XCB: protocol specifications as data2021-03-25T21:14:41-07:002021-03-25T21:14:41-07:00https://jamey.thesharps.us/2021/03/25/xcb-protocol-specifications-data<p>This year is the 20th anniversary of when I started writing <a href="https://xcb.freedesktop.org/">XCB</a>, and
I’ve been thinking about how best to implement network protocols ever
since. I have a lot to say on that topic but in this post I just want to
tell the story of how XCB’s protocol specification language came to be.</p>
<p>This story begins over sushi lunch on the last Thursday of August, 2001.
I’m not sure when <a href="https://keithp.com/">Keith Packard</a> and Professor <a href="https://web.cecs.pdx.edu/~bart/">Bart Massey</a> began
discussing how a C-language client library for the X protocol ought to
work, but Bart had been passing around a paper abstract on the topic for
a couple months by then. Most of XCB’s eventual design principles were
already described in that abstract, even though no code existed yet. In
particular, Bart wrote:</p>
<blockquote>
<p>It is expected that [XCB] will be implemented as directly as possible
from the X Protocol specification. It would be nice to at least
partially automate the generation of the API implementation directly
from the X Protocol document: this concept is currently being
explored.</p>
</blockquote>
<p>The person exploring that concept was me, and that Thursday I was
telling Bart and Keith that I didn’t think I could get it to work. The
existing human-readable documentation on the X protocol followed some
consistent typographical conventions that seemed parsable, but after
writing a small pile of Perl with limited success, I concluded that was
a dead end.</p>
<p>Over lunch, Bart and Keith suggested that I could instead try writing
new protocol specifications with code generation in mind. Specifically,
Bart suggested that I use the <code class="language-plaintext highlighter-rouge">m4</code> text preprocessor to express each
part of the protocol as a macro invocation.</p>
<p>Of the people who have heard of <code class="language-plaintext highlighter-rouge">m4</code>, the most likely response to any
mention of it is some combination of disgust or horror. I think most
people have only experienced it in conjunction with either Sendmail
configuration or <code class="language-plaintext highlighter-rouge">autoconf</code>, and in both cases <code class="language-plaintext highlighter-rouge">m4</code> was used to paper
over complexity by burying it in layers of complex macros. I don’t think
it’s fair to blame <code class="language-plaintext highlighter-rouge">m4</code> for that, though.</p>
<p>Anyway, I didn’t know any of that. So while Keith told Bart over the
weekend that he couldn’t believe Bart would inflict <code class="language-plaintext highlighter-rouge">m4</code> on me, and
while Bart was starting to feel guilty about it, I was naively forging
ahead and making it work.</p>
<p>XCB’s code generation continued to be based on <code class="language-plaintext highlighter-rouge">m4</code> for the next three
years, until <a href="https://github.com/joshtriplett/">Josh Triplett</a> took Bart’s “Open Source UNIX Software
Development” class and replaced all my beautiful <code class="language-plaintext highlighter-rouge">m4</code> macros with a
combination of XML and XSLT. Despite that, we became good friends. In
turn, Josh’s XSLT was later replaced with Python, which is the
combination that XCB still uses now.</p>
<p>Today, XCB consists of about 2,900 lines of hand-written C, and about
4,000 lines of Python that’s used only at build time. The protocol
specifications total almost 19,000 lines of XML. By contrast, libX11
(which XCB was intended to replace) is about 93,000 lines of
hand-written C, not counting the many libraries that implement various
protocol extensions.</p>
<p>There has only been one CVE assigned to XCB, ever: <a href="https://cve.circl.lu/cve/CVE-2013-2064">CVE-2013-2064</a>. By
contrast, libX11 and its extension libraries have had over 50, in some
cases because similar bugs were replicated across many of the libraries.
This highlights the value of the <a href="http://langsec.org/">Language-theoretic Security</a>
(LangSec) philosophy, which I only just learned is a specific thing
people study. (See <a href="http://langsec.org/papers/langsec-cwes-secdev2016.pdf">The Seven Turrets of Babel: A Taxonomy of LangSec
Errors and How to Expunge Them</a> for a comprehensive
overview of security issues in protocol implementations.)</p>
<p>Besides reducing the amount of code to maintain and making the
implementation more resistant to security bugs, making the protocol
descriptions into pure data instead of code enabled new software to be
built that was difficult to write before.</p>
<p>My favorite example is that in 2009, <a href="https://gitlab.com/wireshark/wireshark/-/issues/2981">Wireshark’s protocol dissector for
X was replaced</a> with code auto-generated from XCB’s XML
protocol descriptions. Before that, it could only dissect the core
protocol, but this change allowed it to support all the X protocol
extensions as well. As extensions have been added and changed over time,
Wireshark has only needed to be rebuilt against newer xcbproto, and it
immediately gained support for the new extensions.</p>
<p>Bart’s original stated goal, that we should automate the generation of
the API implementation, really paid off. This was the first project
where I saw the value of machine-readable protocol specifications. I’ve
kept trying to replicate that result ever since, but I’ve never been
completely satisfied with the available options for specification
languages or methodologies.</p>
<p>I intend for this to be the first post in a series about protocol
implementations I’ve worked on, where each one will highlight different
aspects of desirable features for protocol implementers. Then I’ll wrap
up the series with discussion of the connection between network
protocols and an aspect of formal language theory which I recently
became aware of, called “subsequential finite state transducers”.</p>
<p>Until then, I hope you enjoyed this peek behind the curtains of XCB’s
development!</p>Jamey SharpThis year is the 20th anniversary of when I started writing XCB, and I’ve been thinking about how best to implement network protocols ever since. I have a lot to say on that topic but in this post I just want to tell the story of how XCB’s protocol specification language came to be.Quickly building minimal Docker containers with Nix2021-02-02T17:12:00-08:002021-02-02T17:12:00-08:00https://jamey.thesharps.us/2021/02/02/docker-containers-nix<p>For several years now I’ve been using <a href="http://dokku.viewdocs.io/dokku/">Dokku</a> on my personal web
server as an easy way to spin up new web apps I want to play around
with. Like the <a href="https://www.heroku.com/">Heroku</a> platform on which it’s modeled, Dokku allows
me to deploy a new version of an app by simply running <code class="language-plaintext highlighter-rouge">git push</code>, but
Dokku is fully open-source and I can self-host it.</p>
<p>I’ve written before that Dokku imposes constraints I don’t like, so as
part of a plan to replace it with something that more exactly meets my
needs, I wrote “<a href="/2020/04/24/most-general-reverse-proxy/">The most general reverse proxy</a>”. That only addressed
a small portion of what Dokku is currently doing for me, though, and I’m
starting to think I need to write my own service manager to capture the
rest; neither systemd nor its competitors seem to have any support for
rolling deployments.</p>
<p>Meanwhile, I’ve also been working on addressing the things I find most
painful about Dokku as I continue using it. And I’ve concluded that what
frustrates me most in routine usage of Dokku is all inherited from
Heroku’s design, by way of the <a href="https://github.com/gliderlabs/herokuish">Herokuish</a> compatibility layer that
Dokku uses by default.</p>
<p>Let’s be clear here: both Dokku and Heroku are great tools, and if you
do any web development, or just want to self-host web apps that others
have developed, you should definitely try them out. But it’s always
worth trying to improve things, and I’ve come up with a bit of a hack
that’s already working much better for my needs.</p>
<h1 id="how-it-works-now">How it works now</h1>
<p>The core of my complaint is the method that Heroku, and by extension
Herokuish, uses to construct an application container image.</p>
<p>Every time I run <code class="language-plaintext highlighter-rouge">git push</code>, the new version of my app needs to be
combined with all its dependencies into a bundle that the hosting
platform can execute. The difficulty is in how I as an app developer
should specify those dependencies in such a way that the platform can
find and install them for me.</p>
<p>Heroku uses two different approaches for this. First, Heroku provides a
base image containing a stock Ubuntu install and a collection of
commonly-used packages, such as compilers and database libraries. (See
the official <a href="https://github.com/heroku/stack-images">Heroku Stack Images</a> for details.) This stack image is
around 1.5GB of stuff that a lot of people need some of the time, but of
course for any given app the vast majority of it is unused.</p>
<p>After that, <a href="https://devcenter.heroku.com/articles/buildpacks">buildpacks</a> provide language-specific autodetection of
dependencies.<sup id="fnref:cnb" role="doc-noteref"><a href="#fn:cnb" class="footnote">1</a></sup> For example:</p>
<ul>
<li>the Python buildpack looks for common Python config files like
<code class="language-plaintext highlighter-rouge">requirements.txt</code> or <code class="language-plaintext highlighter-rouge">Pipfile.lock</code> and installs whatever Python
packages those files specify;</li>
<li>the PHP buildpack looks for <code class="language-plaintext highlighter-rouge">composer.json</code>,</li>
<li>the Node.js buildpack looks for <code class="language-plaintext highlighter-rouge">package.json</code>,</li>
<li>the Ruby buildpack looks for <code class="language-plaintext highlighter-rouge">Gemfile.lock</code>,</li>
</ul>
<p>and so on. Heroku has a collection of officially-supported buildpacks
but other developers can write their own and share them. If one or more
of the <a href="https://elements.heroku.com/buildpacks">7,246 existing buildpacks</a> applies to your
project, this makes getting started really easy: you don’t need to
understand anything about how Heroku works, only how to use your chosen
development environment’s tools. It’s magic!</p>
<p>If neither the stack image nor any of the available buildpacks does what
you need, though, then you fall off a complexity cliff. Suddenly you
need to understand the filesystem layout of a Heroku container, think
about which version and architecture of Ubuntu it’s based on, and dig
deep into what the buildpacks were doing for you up to that point. It’s
<em>too much</em> magic.</p>
<p>Buildpacks are specialized enough that people fall off that complexity
cliff all the time, as evidenced by the fact that there are over seven
thousand buildpacks. That’s an average of two or three new buildpacks
every day since <a href="https://blog.heroku.com/buildpacks">Heroku announced buildpacks in 2012</a>.</p>
<p>That shallowly-buried complexity was already enough to make me want
something better, but what really bothers me every time I run <code class="language-plaintext highlighter-rouge">git push</code>
is how slow the build process is. I think there are two major causes for
this:</p>
<ol>
<li>
<p>It’s entirely sequential, which can’t be fixed without collecting
information about the dependencies between build steps. This is the
usual situation for Docker images in general, not just Heroku.</p>
</li>
<li>
<p>Although buildpacks have always been allowed to cache build products
from one deploy to the next, it’s up to the buildpack author to use
the cache correctly, and that’s <a href="https://blog.heroku.com/speeding-up-sprockets">hard</a> to <a href="https://github.com/heroku/heroku-buildpack-nodejs/issues/1">get
right</a>. I suspect buildpack maintainers who try to
implement really aggressive caching are punished by more bugs and
headaches than it’s worth to them.</p>
</li>
</ol>
<p>Last and probably least, it bothers me that there’s so much unnecessary
stuff in the final deployed image. Programs which are only needed at
build time, such as C compilers and static site generators, remain in
the image when deployed; and then, of course, there’s the 1.5GB of
mostly-unused Ubuntu packages. Because Docker only stores one copy of
the stack image and shares it between all the apps which use it, this is
more of an aesthetic consideration rather than a real problem. That
said, if your app has a security hole, this is a huge attack surface
which can make exploiting bugs easier.</p>
<p>I have a fix for all of these problems… in exchange for creating
new ones, naturally.</p>
<h1 id="building-minimal-docker-images-quickly-with-nix">Building minimal Docker images quickly with Nix</h1>
<p>I’ve started using the <a href="https://nixos.org/">Nix package manager</a> to build the Docker
images that I deploy with Dokku. Most people aren’t familiar with Nix,
so I’m going to give a little background, but since this blog post isn’t
really about Nix, I’m going to oversimplify a bunch of details. If you
are familiar with Nix, please just ignore the inaccuracies. kthx 😅</p>
<p>Nix provides a unified way to describe the complete build-time and
run-time dependencies of arbitrary software. Instead of writing giant
shell scripts to sequence all your build steps, Nix gives you a
programming language that treats build products as first-class objects.
The result of running a Nix program (a “derivation”) is the dependency
tree of all the steps needed to build the specified software from
scratch. That dependency tree can then be “realized” to run all the
steps and complete the build.</p>
<p>Nix enforces that a derivation’s dependencies are specified correctly,
by building it in an environment where only the requested versions of
those dependencies are available. As a result, Nix can automatically and
reliably determine whether two builds can proceed in parallel and
whether a previous build of a derivation can be reused because its
inputs haven’t changed.</p>
<p>So Nix builds get parallelism and aggressive caching for free. In fact,
it has built-in support for distributing builds across, and sharing
caches between, multiple computers.</p>
<p>Of course, somebody has to write all those Nix recipes, and it’s easiest
if you can just reuse recipes that somebody else already wrote. The
usual source for those is the Nix Packages collection, <a href="https://search.nixos.org/packages">Nixpkgs</a>,
although various communities also provide package “overlays” that extend
Nixpkgs.</p>
<p>But just like having thousands of buildpacks available still doesn’t
cover every use case, sometimes I’ve needed something that wasn’t
packaged.<sup id="fnref:no-examples" role="doc-noteref"><a href="#fn:no-examples" class="footnote">2</a></sup> Compared to writing buildpacks, though, Nix
build recipes have a much more gradual learning curve,<sup id="fnref:learning" role="doc-noteref"><a href="#fn:learning" class="footnote">3</a></sup> and
the <a href="https://nixos.org/manual/nixpkgs/stable/">Nixpkgs manual</a> offers extensive documentation for the many
different helpers that Nixpkgs maintainers have come up with over the
years.</p>
<p>Once you have a recipe that builds your application, Nixpkgs provides an
easy way to turn that into a Docker image. For example, I build this
blog with a Nix expression that looks something like this, where <code class="language-plaintext highlighter-rouge">pkgs</code>
is an instance of Nixpkgs and <code class="language-plaintext highlighter-rouge">config</code> is the result of another
derivation that writes the <code class="language-plaintext highlighter-rouge">lighttpd.conf</code> I want:</p>
<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">pkgs</span><span class="o">.</span><span class="nv">dockerTools</span><span class="o">.</span><span class="nv">streamLayeredImage</span> <span class="p">{</span>
<span class="nv">name</span> <span class="o">=</span> <span class="s2">"jamey.thesharps.us"</span><span class="p">;</span>
<span class="nv">config</span><span class="o">.</span><span class="nv">Cmd</span> <span class="o">=</span> <span class="p">[</span> <span class="s2">"</span><span class="si">${</span><span class="nv">pkgs</span><span class="o">.</span><span class="nv">lighttpd</span><span class="si">}</span><span class="s2">/bin/lighttpd"</span> <span class="s2">"-D"</span> <span class="s2">"-f"</span> <span class="nv">config</span> <span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Nix builds are isolated so it can’t load the image into Docker directly,
so the result of that recipe is actually a shell script which constructs
the Docker image. To load it into Docker, I run:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="si">$(</span>nix-build<span class="si">)</span> | docker load
</code></pre></div></div>
<p>I generate my lighttpd config file with a recipe like the following,
where <code class="language-plaintext highlighter-rouge">docroot</code> is the result of yet another derivation which runs
Jekyll to build the static HTML of this blog. One neat thing is that the
<code class="language-plaintext highlighter-rouge">checkPhase</code> script will fail the build if lighttpd can’t parse the
configuration, rather than having lighttpd fail to start after I’ve
already deployed the broken configuration.</p>
<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">pkgs</span><span class="o">.</span><span class="nv">writeTextFile</span> <span class="p">{</span>
<span class="nv">name</span> <span class="o">=</span> <span class="s2">"lighttpd.conf"</span><span class="p">;</span>
<span class="nv">checkPhase</span> <span class="o">=</span> <span class="s2">''</span><span class="err">
</span><span class="s2"> PORT=5000 </span><span class="si">${</span><span class="nv">pkgs</span><span class="o">.</span><span class="nv">lighttpd</span><span class="si">}</span><span class="s2">/bin/lighttpd -tt -f $n</span><span class="err">
</span><span class="s2"> ''</span><span class="p">;</span>
<span class="nv">text</span> <span class="o">=</span> <span class="s2">''</span><span class="err">
</span><span class="s2"> server.document-root = "</span><span class="si">${</span><span class="nv">docroot</span><span class="si">}</span><span class="s2">"</span><span class="err">
</span><span class="s2"> server.port = env.PORT</span><span class="err">
</span><span class="s2"> include "</span><span class="si">${</span><span class="nv">pkgs</span><span class="o">.</span><span class="nv">lighttpd</span><span class="si">}</span><span class="s2">/share/lighttpd/doc/config/conf.d/mime.conf"</span><span class="err">
</span><span class="s2"> etag.use-mtime = "disable"</span><span class="err">
</span><span class="s2"> setenv.set-response-header = ("Last-Modified" => "")</span><span class="err">
</span><span class="s2"> # more settings follow...</span><span class="err">
</span><span class="s2"> ''</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This illustrates one “gotcha” of using Nix, which sets the mtime of all
files to 0. That means mtime is not useful in <code class="language-plaintext highlighter-rouge">ETag</code> or <code class="language-plaintext highlighter-rouge">Last-Modified</code>
headers. Cache validation can still get correct results because all the
inodes should change on every rebuild, so the generated <code class="language-plaintext highlighter-rouge">ETag</code> should
change if any content changes. That invalidates more cache entries than
necessary but at least it’s sufficient for correctness.</p>
<p>Since <code class="language-plaintext highlighter-rouge">config.Cmd</code> references both the lighttpd binary and my config
file, both are automatically included into the image. In addition,
lighttpd needs some libraries, and the config file references the
Jekyll-built HTML, so all of those are included too.</p>
<p>But after the HTML is generated I don’t need Jekyll or Ruby any more, so
those are <em>not</em> added to the image. Only the derivations which are
transitively referenced are included, so there’s nothing extraneous
being added at all. The Docker image for this blog is about 42MB,
instead of 1.5-2GB, and five-sixths of that comes from glibc and
OpenSSL, which aren’t worth getting rid of.<sup id="fnref:closure-size" role="doc-noteref"><a href="#fn:closure-size" class="footnote">4</a></sup></p>
<p>So switching to Nix addresses all the complaints I mentioned earlier: it
has a gradual learning curve rather than a complexity cliff, it builds
images more quickly, and the images it builds are as small as possible.</p>
<h1 id="integrating-nix-with-dokku">Integrating Nix with Dokku</h1>
<p>Now for the part which is currently more of an ugly hack than ready for
production. The first time you <code class="language-plaintext highlighter-rouge">git push</code> to a new Dokku-hosted repo,
Dokku creates a git <code class="language-plaintext highlighter-rouge">pre-receive</code> hook which looks something like this:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/usr/bin/env bash</span>
<span class="nb">set</span> <span class="nt">-e</span>
<span class="nb">set</span> <span class="nt">-o</span> pipefail
<span class="nb">cat</span> | <span class="nv">DOKKU_ROOT</span><span class="o">=</span><span class="s2">"/home/dokku"</span> dokku git-hook jamey.thesharps.us
</code></pre></div></div>
<p>All the work of building a Docker image using Herokuish is done by
Dokku’s internal <code class="language-plaintext highlighter-rouge">git-hook</code> trigger.</p>
<p>To make <code class="language-plaintext highlighter-rouge">git push</code> trigger a Nix build instead, I replaced that with
this script:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/usr/bin/env bash</span>
<span class="nb">set</span> <span class="nt">-e</span><span class="p">;</span> <span class="nb">set</span> <span class="nt">-o</span> pipefail<span class="p">;</span>
<span class="nb">export </span><span class="nv">NIX_PATH</span><span class="o">=</span><span class="s2">"/nix/var/nix/profiles/per-user/root/channels"</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="s2">"/nix/var/nix/profiles/default/bin:</span><span class="nv">$PATH</span><span class="s2">"</span>
<span class="nb">export </span><span class="nv">DOKKU_ROOT</span><span class="o">=</span><span class="s2">"/home/dokku"</span>
<span class="nv">app</span><span class="o">=</span>jamey.thesharps.us
<span class="k">while </span><span class="nb">read </span>old new ref<span class="p">;</span> <span class="k">do
if</span> <span class="o">!</span> <span class="nb">test</span> <span class="s2">"</span><span class="nv">$ref</span><span class="s2">"</span> <span class="o">=</span> <span class="s2">"refs/heads/master"</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"skipping ref </span><span class="nv">$ref</span><span class="s2">--the deployment branch is master"</span> <span class="o">></span>&2
<span class="k">continue
fi
</span><span class="nv">script</span><span class="o">=</span><span class="si">$(</span>nix-build <span class="nt">--expr</span> <span class="s2">"import </span><span class="se">\"\$</span><span class="s2">{builtins.fetchGit {
url = ./.;
rev = </span><span class="se">\"</span><span class="nv">$new</span><span class="se">\"</span><span class="s2">;
}}/docker.nix</span><span class="se">\"</span><span class="s2"> {}"</span><span class="si">)</span>
<span class="nv">image</span><span class="o">=</span><span class="si">$(</span><span class="nv">$script</span> | docker load | <span class="nb">sed</span> <span class="nt">-n</span> <span class="s1">'$s/^Loaded image: //p'</span><span class="si">)</span>
docker image tag <span class="s2">"</span><span class="nv">$image</span><span class="s2">"</span> dokku/<span class="s2">"</span><span class="nv">$app</span><span class="s2">"</span>:latest
dokku tags:deploy <span class="s2">"</span><span class="nv">$app</span><span class="s2">"</span>
<span class="k">done</span>
</code></pre></div></div>
<p>This new script does the following steps:</p>
<ol>
<li>
<p>Tell Nix to find and evaluate the <code class="language-plaintext highlighter-rouge">docker.nix</code> file in the
newly-pushed revision. This builds all the dependencies and the
image-generating shell script.</p>
</li>
<li>
<p>Generate the image, load it into Docker, and find out what tag it got
assigned. The same commit might get deployed to multiple different
apps, so it shouldn’t have the app name hard-coded into it, and I’m
letting the Nixpkgs docker tools generate the image tag from a hash
of its dependencies.</p>
</li>
<li>
<p>Apply a new tag to the image using the naming convention that Dokku
expects. For example: <code class="language-plaintext highlighter-rouge">dokku/jamey.thesharps.us:latest</code></p>
</li>
<li>
<p>Finally, tell Dokku to deploy the tagged image.</p>
</li>
</ol>
<p>I’d rather do all this in a Dokku plugin, but I dug around the source
code a bit and didn’t see any way that plugins can provide alternatives
to Herokuish or Dockerfile deployments. If I have enough energy for it,
at some point I’ll open an issue about this on the Dokku repo.</p>
<p>Dokku supports deploying Docker images built on a CI server (as does
Heroku, I gather), so as an alternative I could have set up a CI server
that I <code class="language-plaintext highlighter-rouge">git push</code> to and which deploys the build result into Dokku.</p>
<p>But this hack works for me, and I’ve switched most of the apps on my
personal server over to it.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Switching from Herokuish to Nix while still deploying using Dokku is a
pretty big change. For me, it’s already paid off several times over: I
can concisely express exactly the configuration and dependencies that an
app requires, and I feel much more comfortable deploying now that the
build step is usually seconds instead of minutes.</p>
<p>I’m less happy about the method I’ve used to glue Nix to Dokku. But
Dokku is being actively developed, so I have hope that at some point the
maintainers will introduce more flexible triggers that allow plugins to
implement alternative build methods like this.</p>
<p>I don’t see any reason Heroku or other similar platforms couldn’t offer
Nix builds as an option, for that matter.</p>
<p>Nix may not be the right tool for you. Dokku and Heroku may not meet
your needs either. But I think they’re all tools worth knowing about, at
the very least.</p>
<h1 id="appendix-language-specific-examples">Appendix: Language-specific examples</h1>
<p>The original purpose of the buildpack abstraction was to make it easy to
get a container from source code written in a wide range of programming
languages. So if we replace buildpacks with Nix, how much work does it
take to get started?</p>
<p>The answer varies a lot from one language to the next, but here are a
couple of examples from the deployments I’ve done using Nix recently.</p>
<h2 id="python">Python</h2>
<p>A new tool that’s made deploying my Python projects almost trivial is
<a href="https://github.com/nix-community/poetry2nix">poetry2nix</a>, which gets all of the packaging information it needs
from the <code class="language-plaintext highlighter-rouge">poetry.lock</code> file written by the Python <a href="https://python-poetry.org/">Poetry</a> package
manager. I can run a command like <code class="language-plaintext highlighter-rouge">poetry add --lock gunicorn</code> to
declare a new dependency on a Python package and the Nix build
immediately picks it up, using the exact version that Poetry’s
dependency resolver selected.</p>
<p>If you’re using either the unstable version of Nixpkgs or the overlay
that the poetry2nix developers provide, you should be able to build a
working Docker image from any Poetry-managed project with a recipe like
this one, which I’ve simplified from the <code class="language-plaintext highlighter-rouge">docker.nix</code> in my
<a href="https://github.com/jameysharp/predictable">predictable</a> project:</p>
<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span>
<span class="nv">pkgs</span> <span class="o">=</span> <span class="kr">import</span> <span class="o"><</span><span class="nv">nixpkgs</span><span class="o">></span> <span class="p">{};</span>
<span class="nv">app</span> <span class="o">=</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">poetry2nix</span><span class="o">.</span><span class="nv">mkPoetryApplication</span> <span class="p">{</span>
<span class="nv">projectDir</span> <span class="o">=</span> <span class="sx">./.</span><span class="p">;</span>
<span class="p">};</span>
<span class="kn">in</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">dockerTools</span><span class="o">.</span><span class="nv">streamLayeredImage</span> <span class="p">{</span>
<span class="nv">name</span> <span class="o">=</span> <span class="s2">"predictable"</span><span class="p">;</span>
<span class="nv">contents</span> <span class="o">=</span> <span class="p">[</span> <span class="nv">app</span><span class="o">.</span><span class="nv">dependencyEnv</span> <span class="p">];</span>
<span class="nv">config</span><span class="o">.</span><span class="nv">Cmd</span> <span class="o">=</span> <span class="p">[</span> <span class="s2">"/bin/gunicorn"</span> <span class="s2">"predictable:app"</span> <span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The best part is that poetry2nix builds each Python package as a
separate Nix derivation. So unlike standard Python tools like Pip which
install all packages into one <code class="language-plaintext highlighter-rouge">site-packages</code> directory, this means</p>
<ul>
<li>multiple Python packages can build in parallel,</li>
<li>only Python packages which have changed need to be rebuilt,</li>
<li>and each Python package is built in an isolated environment with only
its declared dependencies available, avoiding surprises later from
unspecified dependencies.</li>
</ul>
<p>This tooling also works nicely together with <a href="https://direnv.net/"><code class="language-plaintext highlighter-rouge">direnv</code></a> (which I’ve
written about before in “<a href="/2019/05/29/per-project-postgres/">Per-project Postgres</a>”) for managing a local
development environment. Here’s a simplified version of the <code class="language-plaintext highlighter-rouge">shell.nix</code>
which you can find alongside the above <code class="language-plaintext highlighter-rouge">docker.nix</code>:</p>
<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span>
<span class="nv">pkgs</span> <span class="o">=</span> <span class="kr">import</span> <span class="o"><</span><span class="nv">nixpkgs</span><span class="o">></span> <span class="p">{};</span>
<span class="nv">app</span> <span class="o">=</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">poetry2nix</span><span class="o">.</span><span class="nv">mkPoetryEnv</span> <span class="p">{</span>
<span class="nv">projectDir</span> <span class="o">=</span> <span class="sx">./.</span><span class="p">;</span>
<span class="nv">editablePackageSources</span><span class="o">.</span><span class="nv">predictable</span> <span class="o">=</span> <span class="sx">./.</span><span class="p">;</span>
<span class="p">};</span>
<span class="kn">in</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">mkShell</span> <span class="p">{</span> <span class="nv">buildInputs</span> <span class="o">=</span> <span class="p">[</span> <span class="nv">app</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">poetry</span> <span class="p">];</span> <span class="p">}</span>
</code></pre></div></div>
<p>Then if I just write <code class="language-plaintext highlighter-rouge">use nix</code> in <code class="language-plaintext highlighter-rouge">.envrc</code> and run <code class="language-plaintext highlighter-rouge">direnv allow</code>, when
I <code class="language-plaintext highlighter-rouge">cd</code> into that project’s working copy, I get a complete development
environment automatically.</p>
<h2 id="php">PHP</h2>
<p>I’m not real familiar with the PHP ecosystem, but there are web apps I
want to use that are written in PHP, so I dug into the Nixpkgs support
for PHP a bit and ended up with a Nix recipe that looks something like
this:</p>
<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span>
<span class="nv">pkgs</span> <span class="o">=</span> <span class="kr">import</span> <span class="o"><</span><span class="nv">nixpkgs</span><span class="o">></span> <span class="p">{};</span>
<span class="nv">lighttpd</span> <span class="o">=</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">lighttpd</span><span class="p">;</span>
<span class="nv">php</span> <span class="o">=</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">php</span><span class="p">;</span>
<span class="nv">config</span> <span class="o">=</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">writeTextFile</span> <span class="p">{</span>
<span class="nv">name</span> <span class="o">=</span> <span class="s2">"lighttpd.conf"</span><span class="p">;</span>
<span class="nv">checkPhase</span> <span class="o">=</span> <span class="s2">''</span><span class="err">
</span><span class="s2"> PORT=5000 </span><span class="si">${</span><span class="nv">lighttpd</span><span class="si">}</span><span class="s2">/bin/lighttpd -tt -f $n</span><span class="err">
</span><span class="s2"> ''</span><span class="p">;</span>
<span class="nv">text</span> <span class="o">=</span> <span class="s2">''</span><span class="err">
</span><span class="s2"> server.port = env.PORT</span><span class="err">
</span><span class="s2"> server.upload-dirs = ("/tmp")</span><span class="err">
</span><span class="s2"> fastcgi.server = (".php" => ((</span><span class="err">
</span><span class="s2"> "bin-path" => "</span><span class="si">${</span><span class="nv">php</span><span class="si">}</span><span class="s2">/bin/php-cgi",</span><span class="err">
</span><span class="s2"> "bin-environment" => (</span><span class="err">
</span><span class="s2"> # Nixpkgs 20.09 wraps bin/php to set</span><span class="err">
</span><span class="s2"> # this but does not wrap bin/php-cgi</span><span class="err">
</span><span class="s2"> "PHP_INI_SCAN_DIR" => "</span><span class="si">${</span><span class="nv">php</span><span class="si">}</span><span class="s2">/lib",</span><span class="err">
</span><span class="s2"> ),</span><span class="err">
</span><span class="s2"> # see lighttpd and php docs for more that should go here</span><span class="err">
</span><span class="s2"> )))</span><span class="err">
</span><span class="s2"> ''</span><span class="p">;</span>
<span class="p">};</span>
<span class="kn">in</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">dockerTools</span><span class="o">.</span><span class="nv">streamLayeredImage</span> <span class="p">{</span>
<span class="nv">name</span> <span class="o">=</span> <span class="s2">"phpapp"</span><span class="p">;</span>
<span class="nv">extraCommands</span> <span class="o">=</span> <span class="s2">''</span><span class="err">
</span><span class="s2"> # PHP expects to be able to create a lockfile in /tmp</span><span class="err">
</span><span class="s2"> mkdir -m 1777 tmp</span><span class="err">
</span><span class="s2"> ''</span><span class="p">;</span>
<span class="nv">config</span><span class="o">.</span><span class="nv">Cmd</span> <span class="o">=</span> <span class="p">[</span> <span class="s2">"</span><span class="si">${</span><span class="nv">lighttpd</span><span class="si">}</span><span class="s2">/bin/lighttpd"</span> <span class="s2">"-D"</span> <span class="s2">"-f"</span> <span class="nv">config</span> <span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>
<p>There were two non-obvious things I had to do. One is that there’s no
<code class="language-plaintext highlighter-rouge">/tmp</code> in Nixpkgs-generated Docker images by default—in fact,
there’s nothing except <code class="language-plaintext highlighter-rouge">/nix/store/</code>—and both PHP and lighttpd
sometimes need a place to write temporary files. So I used the
<code class="language-plaintext highlighter-rouge">extraCommands</code> option to create that directory. (lighttpd defaults to
<code class="language-plaintext highlighter-rouge">/var/tmp</code> which of course doesn’t exist either, so I configured it to
use the same <code class="language-plaintext highlighter-rouge">/tmp</code> directory instead.)</p>
<p>The other is, I think, a bug in the current release of Nixpkgs, which I
ought to open an issue about. PHP quite naturally doesn’t know how to
find anything in the directory structure that Nix uses, so the Nix
package of PHP auto-generates a <code class="language-plaintext highlighter-rouge">php.ini</code> file with the full path to
each enabled PHP extension. But then, PHP doesn’t know how to find that
<code class="language-plaintext highlighter-rouge">php.ini</code> file either, so the package arranges that the <code class="language-plaintext highlighter-rouge">php</code> command is
actually a shell script that sets <code class="language-plaintext highlighter-rouge">PHP_INI_SCAN_DIR</code> to the right path
before running the real <code class="language-plaintext highlighter-rouge">php</code> binary. However, the package doesn’t wrap
<code class="language-plaintext highlighter-rouge">php-cgi</code> in the same way, so that binary can’t find <code class="language-plaintext highlighter-rouge">php.ini</code> either.
As a workaround, I set the variable in the lighttpd config instead.</p>
<p>There’s one more useful trick I learned. By default, the PHP package
enables a bunch of extensions, which is good for getting started
quickly, but it also pulls in a lot of unnecessary dependencies. Once I
had things working, I replaced the <code class="language-plaintext highlighter-rouge">php = pkgs.php;</code> line above with
this:</p>
<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nv">php</span> <span class="o">=</span> <span class="nv">pkgs</span><span class="o">.</span><span class="nv">php</span><span class="o">.</span><span class="nv">buildEnv</span> <span class="p">{</span>
<span class="nv">extensions</span> <span class="o">=</span> <span class="p">{</span> <span class="nv">all</span><span class="p">,</span> <span class="o">...</span> <span class="p">}:</span> <span class="kn">with</span> <span class="nv">all</span><span class="p">;</span> <span class="p">[</span>
<span class="nv">json</span>
<span class="nv">session</span>
<span class="c"># any other extensions you want here</span>
<span class="p">];</span>
<span class="nv">extraConfig</span> <span class="o">=</span> <span class="s2">''</span><span class="err">
</span><span class="s2"> upload_max_filesize = 64M</span><span class="err">
</span><span class="s2"> ; more custom php.ini settings follow ...</span><span class="err">
</span><span class="s2"> ''</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:cnb" role="doc-endnote">
<p>Heroku and others are now collaborating on a standard for “<a href="https://buildpacks.io/">Cloud
Native Buildpacks</a>” under the Cloud Native
Computing Foundation. Although Heroku defined the idea, this
discussion applies to a growing number of other platforms. <a href="#fnref:cnb" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:no-examples" role="doc-endnote">
<p>I don’t have any good examples of unpackaged dependencies from the
web apps I’ve deployed though because it seems there was a perfectly
good package of the GeoIP C library at the time that I wrote my own
recipe for it—I have no record of why I didn’t just use
that, but I may just have not known it was there—and the
Python tooling I’m using now makes the Python package recipes I’ve
written in the past entirely unnecessary. <a href="#fnref:no-examples" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:learning" role="doc-endnote">
<p>Using Nix for deployment means you need to understand at least a
little bit about Nix to get started, while it’s possible to get
started with Heroku without understanding Heroku’s architecture at
all. Personally, I think it’s worth investing a little time up-front
to avoid falling off the complexity cliff the moment I try to do
something just a little bit unusual. You may feel differently, and
that’s okay. <a href="#fnref:learning" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:closure-size" role="doc-endnote">
<p>I could rebuild lighttpd against musl libc and without TLS support
to get my blog’s Docker image down to little more than the size of
the HTML. It isn’t even particularly hard to do. Asking for
<code class="language-plaintext highlighter-rouge">pkgs.pkgsMusl.lighttpd</code> instead of <code class="language-plaintext highlighter-rouge">pkgs.lighttpd</code> does the former,
and the tersest way to remove the OpenSSL dependency is:</p>
<div class="language-nix highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">lighttpd</span><span class="o">.</span><span class="nv">overrideAttrs</span> <span class="p">(</span><span class="nv">old</span><span class="p">:</span> <span class="p">{</span> <span class="nv">configureFlags</span> <span class="o">=</span> <span class="p">[];</span> <span class="p">})</span>
</code></pre></div> </div>
<p>The downside is that the CI builds done by the NixOS project don’t
cover these build configurations, which means Nix has to rebuild
them from source for me instead of using pre-built binaries. Since
one of my big goals was to speed up deployment, this isn’t worth
doing.</p>
<p>Rebuilding with musl is especially time-consuming because all the
dependencies, including build-time dependencies, get rebuilt too.
That includes things like perl, so it takes quite a while. <a href="#fnref:closure-size" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Jamey SharpFor several years now I’ve been using Dokku on my personal web server as an easy way to spin up new web apps I want to play around with. Like the Heroku platform on which it’s modeled, Dokku allows me to deploy a new version of an app by simply running git push, but Dokku is fully open-source and I can self-host it.Feed reader cache coherency with JMAP and RFC50052020-08-06T20:14:59-07:002020-08-06T20:14:59-07:00https://jamey.thesharps.us/2020/08/06/feed-reader-cache-coherency<p>RSS feed readers don’t work as well as they should. Today I’d like to
talk about issues from an application and server developer viewpoint;
I’ll save some thoughts about usability in various use cases for another
time.</p>
<p>In this point I’m going to suggest that, from a technical perspective,
the first step to fixing some important issues is to think of feed
readers as being composed of two layers of caches. That helps clarify
why some of these problems have been hard to solve, and at the same time
points toward some existing solutions that we can reuse.</p>
<h1 id="but-first-email">But first: email</h1>
<p>I think the most interesting recent development in email standards is
<a href="https://jmap.io/">JMAP</a>. Despite the similarity of its name with <a href="https://en.wikipedia.org/wiki/Internet_Message_Access_Protocol">IMAP</a>, the widely
used “Internet Message Access Protocol”, JMAP’s <a href="https://tools.ietf.org/html/rfc8620">RFC8620</a> doesn’t
actually say much about email. Instead it describes a generic “JSON Meta
Application Protocol”:</p>
<blockquote>
<p>This document specifies a protocol for clients to efficiently query,
fetch, and modify JSON-based data objects, with support for push
notification of changes and fast resynchronisation and for out-of-
band binary data upload/download.</p>
</blockquote>
<p>A separate standard, <a href="https://tools.ietf.org/html/rfc8621">RFC8621</a>, describes how to apply JMAP to sending
and receiving email, and there are draft proposals for how to sync
calendars and contacts as well.</p>
<h1 id="two-hard-problems">Two hard problems</h1>
<blockquote>
<p>There are only two hard things in Computer Science: cache invalidation
and naming things. —<a href="https://www.karlton.org/2017/12/naming-things-hard/">Phil Karlton</a></p>
</blockquote>
<p>The reason that I think JMAP is particularly interesting is because it
sets out to solve cache invalidation problems for a useful range of
applications—namely, those where:</p>
<ul>
<li>you have a data-set on a server,</li>
<li>which you want to access from multiple clients,</li>
<li>and you want the user experience to be as if the client kept the whole
data set locally,</li>
<li>but those clients may not have enough resources to do so.</li>
</ul>
<p>JMAP defines ways for a client to cache a portion of the server’s data
set, for the server to inform the client when that cache may be out of
date, and for the client to transfer as little data as possible to get
its cache back up to date.</p>
<p>I’ve spent a lot of time working with the X Window System network
protocol, and learned some good techniques for dealing with high-latency
or low-bandwidth networks. From my studies of X I also concluded that
its extension mechanism was key to why protocol version 11 survived from
1987 and is still in widespread use today. In both areas, I’m happy to
see that JMAP’s designers did an excellent job: they eliminated many
round-trip delays and ensured that client and server can mutually agree
on which extensions to use.</p>
<p>The JMAP-based specification for email doesn’t have to do much more than
describe what data should be associated with an email or mailbox or
whatever other objects are appropriate. It inherits from JMAP the
ability to efficiently synchronize subsets of the server’s email store,
allowing clients to work offline and on flaky internet connections.</p>
<p>Other applications could also build on JMAP. While calendar and contacts
specifications are already underway, today I’d like to suggest another
use case.</p>
<h1 id="clientserver-architectures-for-rss-feed-readers">Client/server architectures for RSS feed readers</h1>
<p>In my previous post, “<a href="/2020/08/03/websub-plus-push/">WebSub plus Push</a>”, I complained that
developers who want to offer software that notices when a web page
changes either need to spam the page with “have you changed yet?”
requests or put up application-specific servers.</p>
<p>I’m particularly interested in this for native apps that monitor
RSS<sup id="fnref:RSS" role="doc-noteref"><a href="#fn:RSS" class="footnote">1</a></sup> feeds to inform me when I have new stuff to read, such as a
news article or a chapter of some serialized fiction or a page of a
webcomic.</p>
<p>Some sites I care about have RSS feeds which are <a href="https://forums.spacebattles.com/threads/rss-feeds-arent-getting-cached-at-cloudflare.815573/">largely
unusable</a> because their servers treat the repeated “have you
changed yet?” requests as an attack. And honestly, that’s fair…
except that those sites don’t implement WebSub and desktop/mobile feed
readers couldn’t use it even if they did, so there’s literally no other
option today.</p>
<p>Getting publishers to implement WebSub is its own challenge which I’m
going to set aside for now, because if most feed readers won’t use it
then it’s hard to argue that publishers should put in the effort.</p>
<p>Given that the only way forward is for native apps to require a
dedicated server that’s always reachable from the internet, what should
a client/server architecture look like for a feed reader?</p>
<p>Various web-based feed readers offer APIs intended for mobile and
desktop apps; <a href="https://tt-rss.org/wiki/ApiReference">Tiny Tiny RSS</a> and <a href="https://www.newsblur.com/api">NewsBlur</a>
are two examples. The <a href="https://indieweb.org/">Indieweb</a> community’s answer is <a href="https://indieweb.org/Microsub">Microsub</a>,
and the motivation they give is, I think, fairly typical of the field:</p>
<blockquote>
<p>Microsub provides a standardized way for reader apps to interact with
feeds. By splitting feed parsing and displaying posts into separate
parts, a reader app can focus on presenting posts to the user instead
of also having to parse feeds. A Microsub server manages the list of
people you’re following and collects their posts, and a Microsub app
shows the posts to the user by fetching them from the server.</p>
</blockquote>
<p>I have several issues with these protocols, but today I’m going to focus
on one: None of these protocols do a great job at cache invalidation.
Clients can’t tell which data they already have might still be valid,
and also must poll the server to learn about changes. (There’s been
discussion for Microsub about <a href="https://github.com/indieweb/microsub/issues/20">streaming</a> and
<a href="https://github.com/indieweb/microsub/issues/12">sync</a> but apparently no consensus has been reached.)</p>
<p>As you’ve probably guessed, this is where I suggest that JMAP could be a
good foundation for a client/server API for feed readers.</p>
<h1 id="jmap-for-feed-readers">JMAP for feed readers</h1>
<p>Most of the work needed to define a JMAP-based feed reader API is in
deciding what kinds of objects need to be synchronized and what
information needs to be attached to each kind of object.</p>
<p>Looking at the three APIs I mentioned above, a first cut at the list of
object types is:</p>
<ul>
<li>Category</li>
<li>Feed</li>
<li>Post</li>
</ul>
<p>Newsblur might add an extension capability for their “intelligence
classifiers”—and I think it’s super important that JMAP
standardized a well thought out extension mechanism to allow for exactly
this kind of situation—but otherwise I think these are the primary
object types in most feed readers.</p>
<p>In existing practice, categories generally just permit setting a name
and an optional parent category, allowing them to be organized into a
tree—or, technically, a forest. The above APIs also provide
read-only information about each category, such as the number of read
and unread posts across all feeds in that category. I’d look at the way
the JMAP email specification models Mailbox objects for a good starting
point here.</p>
<p>Each feed needs to be created by setting a source URL and a category.
Feeds have a variety of additional properties that are generally
read-only because they come from the publisher. Similarly, posts
generally would only be created by the server, from feed entries
retrieved from the feed’s publisher, and the client would only be able
to set state such as “I’ve read this post”.</p>
<p>But there’s no reason that feeds and posts must be read-only! The exact
same data model could work for posting to your own personal blog or
social media accounts. Both NewsBlur and Tiny Tiny RSS have dedicated
API endpoints for this purpose, and in the Indieweb ecosystem there’s a
complementary specification called <a href="https://www.w3.org/TR/micropub/">Micropub</a>. But JMAP reminds us
that there’s no operational difference between reading a post written by
somebody else versus reading one you wrote on another device.</p>
<p>Microsub delegates the representation of feeds and posts to <a href="https://www.w3.org/TR/jf2/">JF2</a>,
which is probably a good starting place for the similarly JSON-based
JMAP. However, I’d pay careful attention to JF2’s extension mechanisms.
If the JMAP server is going to be transforming RSS and Atom and
Microformats-based feeds into a common representation, it really needs
to pass information it doesn’t understand to the client, so that client
development isn’t blocked waiting for servers to catch up with new
extensions.</p>
<p>The last question a JMAP-based specification needs to answer is: What
search queries do clients need to be able to ask the server to perform
on their behalf? I’d start by looking through the searches offered by
existing APIs and comparing them to the <code class="language-plaintext highlighter-rouge">Mailbox/query</code> and
<code class="language-plaintext highlighter-rouge">Email/query</code> sections of the JMAP Mail specification.</p>
<h1 id="cache-coherency-between-server-and-publisher">Cache coherency between server and publisher</h1>
<p>So far I’ve been talking about how a feed reader application interacts
with an internet server that acts as an agent for the reader, but there
are some related topics to consider in how that server agent interacts
with feed publishers.</p>
<p>People working on RSS-related tools usually assume that only the most
recent posts are important. This is generally true for content such as
news articles or social media posts. But it’s false often enough that
many feed readers will save old posts that they’ve seen, even after
those posts disappear from the publisher’s feed.<sup id="fnref:long-form-content" role="doc-noteref"><a href="#fn:long-form-content" class="footnote">2</a></sup></p>
<p>That doesn’t work very well.</p>
<ul>
<li>
<p>It relies on your feed reader having started to watch the feed before
the oldest post you might ever be interested in disappears.</p>
</li>
<li>
<p>People who follow RSS feeds have just gotten used to sometimes seeing
stale contents, or duplicates of posts they already read.</p>
</li>
<li>
<p>On top of all that, I’ve been told by developers of two different
cloud-based feed readers that this practice of saving the complete
contents of old posts contributes significantly to their storage and
hosting costs.</p>
</li>
</ul>
<p>Fortunately, reframing these as caching problems points us in the right
direction to solve them. We just need reliable mechanisms for:</p>
<ul>
<li>cache invalidation (because the publisher says it changed),</li>
<li>cache eviction (because the client doesn’t have resources to keep it),</li>
<li>and cache reloading (because either invalidation or eviction made a
request miss the cache).</li>
</ul>
<p>All three needs are addressed by <a href="https://tools.ietf.org/html/rfc5005">RFC5005</a>, “Feed Paging and
Archiving”, optionally supplemented by <a href="https://tools.ietf.org/html/rfc6721">RFC6721</a>, “The Atom
<code class="language-plaintext highlighter-rouge">deleted-entry</code> Element”. These specifications are from 2007 and 2012
respectively, and I hate how little-known they are.</p>
<p>Section 2 of RFC5005 covers “Complete Feeds”, where the publisher
asserts that every post in the history of the feed is, in fact, present
in the current feed document. In other words, the publisher is directing
any feed reader which encounters such a feed to discard its cached copy
of any post that was previously included but isn’t any more.</p>
<p>However, feeds aren’t usually very interesting unless they have enough
posts that transferring all of them every time anybody checks for
updates would quickly eat up everyone’s bandwidth. So section 4 defines
“Archived Feeds”<sup id="fnref:skip-paged-feeds" role="doc-noteref"><a href="#fn:skip-paged-feeds" class="footnote">3</a></sup>, which allow splitting the feed up
into multiple linked documents. It’s been carefully designed to transfer
reasonably minimal information on changes. One of these days I’ll write
up why every part of that design is necessary, because the RFC itself is
unfortunately terse about rationale.</p>
<p>When a post is added or edited, archived feeds work most efficiently if
the publisher appends that post to the end of the feed, no matter how
long ago its publication date is. Doing the same for deleting a post
requires having something you can append that says “this was deleted”,
rather than just silently making it disappear.</p>
<p>That’s where RFC6721 comes in with its <code class="language-plaintext highlighter-rouge">deleted-entry</code> element. Systems
which don’t already keep track of deleted posts may find it easier to
invalidate all archives going back to the point where a deleted post was
first published, but that does increase the amount of data that has to
be transferred in the (generally rare) situation where a post is
deleted. So implementing <code class="language-plaintext highlighter-rouge">deleted-entry</code> is a good optimization for
larger publishers.</p>
<p>With either complete or archived feeds, a feed reader can freely evict
any information about a feed from its cache, reload it later from the
original feed, and reliably detect when its cache is out of date.</p>
<h1 id="conclusions">Conclusions</h1>
<p>RSS feed reader software could be more robust, less expensive to
operate, and faster at responding to user input by thinking of it simply
as two layers of caches: one cache that’s always publicly reachable on
the internet, and any number of second-tier caches running on the
end-user’s devices. WebSub together with RFC5005, the Feed Paging and
Archiving specification, gives us a foundation for the first layer; JMAP
is a strong contender for the second layer.</p>
<p>My plea is for developers who are working in the RSS feed ecosystem to
implement these standards, and for the rest of you to advocate for these
standards with the developers of your favorite tools.</p>
<p>I’ve written several pieces of software related to RFC5005 which may
help, including:</p>
<ul>
<li>a <a href="https://github.com/jameysharp/wp-fullhistory/">WordPress plugin</a> (see also my <a href="https://core.trac.wordpress.org/ticket/49321">WordPress Core issue</a>)</li>
<li>a <a href="https://github.com/jekyll/jekyll-feed/pull/236">jekyll-feed patch</a></li>
<li>a very rough <a href="http://reader.minilop.net/">prototype feed reader</a></li>
<li><a href="https://github.com/jameysharp/predictable/">predictable</a>, a feed generator for sites with consistent update
schedules</li>
</ul>
<p>This blog is itself using my patched version of jekyll-feed, so you can
also see an example of an archived feed at
<a href="https://jamey.thesharps.us/feed.xml">https://jamey.thesharps.us/feed.xml</a>.</p>
<p>If you’ve read this far, you might also be interested in listening to
these Hacker Public Radio episodes where clacke interviewed me and my
fellow RFC5005 advocate, <a href="https://beesbuzz.biz/">fluffy</a>: <a href="http://hackerpublicradio.org/eps.php?id=3082">part 1</a> and <a href="http://hackerpublicradio.org/eps.php?id=3102">part 2</a>.</p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:RSS" role="doc-endnote">
<p><a href="https://en.wikipedia.org/wiki/RSS">RSS</a> is the most widely-recognized of the mechanisms for web
syndication feeds, so in this article I often use that term
generically. Personally, I prefer <a href="https://en.wikipedia.org/wiki/Atom_(Web_standard)">Atom</a>, and other formats like
<a href="https://www.w3.org/TR/jf2/">JF2</a> or the Microformats-based <a href="http://microformats.org/wiki/h-feed">h-feed</a> are valid alternatives
as well. <a href="#fnref:RSS" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:long-form-content" role="doc-endnote">
<p>Of course, there’s also a lot of content out there where it’s
necessary to see old posts in order to understand newer ones. Many
podcasts and webcomics fall in this category, for example. I have a
lot to say about that but it’s a topic for a later post. <a href="#fnref:long-form-content" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:skip-paged-feeds" role="doc-endnote">
<p>I’m skipping over RFC5005’s section 3, “Paged Feeds”, because it
explicitly is not intended for the kind of cache coherency I’m
talking about in this post, and can’t be used for that purpose. The
specification says, “[C]lients SHOULD NOT present paged feeds as
coherent or complete, or make assumptions to that effect,” and,
“Unlike paged feeds, archived feeds enable clients to do this
without losing entries.” <a href="#fnref:skip-paged-feeds" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Jamey SharpRSS feed readers don’t work as well as they should. Today I’d like to talk about issues from an application and server developer viewpoint; I’ll save some thoughts about usability in various use cases for another time.WebSub plus Push2020-08-03T16:20:46-07:002020-08-03T16:20:46-07:00https://jamey.thesharps.us/2020/08/03/websub-plus-push<p>There are a lot of situations where it’d be nice to know when the
contents of a web page change, and the modern way to do that is the
<a href="https://www.w3.org/TR/websub/">WebSub</a> standard, formerly known as <a href="https://github.com/pubsubhubbub/PubSubHubbub">PubSubHubbub</a>.</p>
<p>It’s an elegantly simple system: A page advertises its support by
providing a link to a “hub” that handles sending out change
notifications. A subscriber sends to that hub a URL where notifications
should be delivered.</p>
<p>However, this requires the subscriber to have a web server that’s always
reachable from the public internet, which means that desktop and mobile
applications can’t take advantage of it without a dedicated server. Even
for some web applications this might be the only aspect which requires a
server; otherwise they could use much cheaper static-file hosting, or be
delivered via something decentralized such as <a href="https://ipfs.io/">IPFS</a>.</p>
<p>And that’s a shame, because there are only a few options available to
developers and they’re all undesirable:</p>
<ul>
<li>
<p>Ignore WebSub and just poll for changes, spamming web sites with “have
you changed yet?” requests.</p>
</li>
<li>
<p>Require users to have the technical skills and resources to set up
their own web server.</p>
</li>
<li>
<p>Build centralized infrastructure, which requires a different set of
skills than app development, and gets expensive as it becomes popular.</p>
</li>
</ul>
<h1 id="push-notifications">Push notifications</h1>
<p>Standards bodies have been working for years on specifications for how
to deliver notifications to applications in a timely fashion without
requiring the application to be associated with a dedicated server.</p>
<p>The web <a href="https://developer.mozilla.org/en-US/docs/Web/API/Push_API">Push API</a> makes this a problem for the browser vendors to
solve, instead of pushing (hah) it onto application developers.
JavaScript can ask the web browser for a push URL, then forward that URL
off to an application server which can post to it any time there’s
something new. The browser arranges that those messages will get back to
the JavaScript service worker… somehow. (The Push API doesn’t
need to specify how the browser communicates with the push server or
even how it finds one to use; browser vendors just provide their own
infrastructure and co-evolve it with their browser products.)</p>
<p>The Push API is built on <a href="https://tools.ietf.org/html/rfc8030">RFC8030</a> which specifies how application
servers should send messages to push servers. While the Push API is
specific to browsers, RFC8030 should in principle be usable for native
mobile and desktop apps as well. In practice, as far as I can tell,
nobody is implementing it for those settings: mobile OSes only provide
proprietary APIs for push notifications, and every desktop app developer
has to just roll their own thing. So for the rest of this article, I’m
going to focus on web apps.</p>
<p>Naturally, everything is harder than I’d like: the push URL expects to
receive messages that are a little different than what a WebSub hub
sends. The details are specified in RFC8030, but in particular:</p>
<ul>
<li>
<p>WebSub sends the entire contents of the changed page, while a push
server may reject messages larger than 4kB.</p>
</li>
<li>
<p>Current push servers require the sender to sign the message with a
public key provided at registration (see <a href="https://tools.ietf.org/html/rfc8292">RFC8292</a>, “VAPID”) but
there’s no provision for doing so in WebSub.</p>
</li>
<li>
<p>WebSub verifies a subscription request by immediately sending a
challenge to the subscriber’s URL, which an RFC8030 server won’t
respond to in the way the WebSub spec requires.</p>
</li>
<li>
<p>Because the Push API is designed for devices which may be offline,
RFC8030 requires specifying a Time-To-Live (TTL) for how long the push
server should keep trying to deliver the message. WebSub has no means
to convey that or any of the other options that RFC8030 allows.</p>
</li>
</ul>
<p>For all those reasons, using WebSub with the Push API still requires a
dedicated server to match each specification’s expectations to the
other. So that’s disappointing.</p>
<h1 id="stateless-impedance-matching">Stateless impedance matching</h1>
<p>If we have to have a separate server, can we at least make it really
cheap to operate? Here are the limits I wanted on this hypothetical
service in order to keep costs down, allow scaling up to any amount of
demand, and enable deployment at any cloud or shared hosting provider:</p>
<ul>
<li>No persistent state: no database, no writable filesystems.</li>
<li>No background or scheduled processing: all work happens while
responding to an HTTP request.</li>
</ul>
<p>I think this is feasible, but I haven’t actually tried it, for reasons
I’ll get into later. First I’ll just outline how I think a lightweight
WebSub proxy would work.</p>
<p>The most complex part is the process for initially setting up
subscriptions:</p>
<ol>
<li>
<p>An application asks the WebSub proxy for its VAPID application server
key, as defined in RFC8292.</p>
</li>
<li>
<p>The application passes the application server key to the Push API to
get a new push URL.</p>
</li>
<li>
<p>The application posts the push URL in an HTTP request to the WebSub
proxy, which doesn’t reply immediately.</p>
</li>
<li>
<p>The proxy signs or HMACs the push URL with its own key, and
constructs a subscription URL containing the signed push URL.</p>
</li>
<li>
<p>The proxy sends the new subscription URL to the application via the
push URL. (This confirms that the application can receive messages
sent to that push URL.)</p>
</li>
<li>
<p>The proxy returns an error if the push URL returned an error;
otherwise it returns an empty response with a “202 Accepted” status
code.</p>
</li>
<li>
<p>Once the application receives the push notification containing its
new subscription URL, it uses that URL in subscription requests to
WebSub hubs.</p>
</li>
<li>
<p>When a hub sends a verification request to the proxy, the proxy
checks that the wrapped push URL was signed by its own key.</p>
</li>
</ol>
<p>Later, when the hub sends a change notification to the proxy, it again
checks that the wrapped push URL was signed by its own key, and then
passes the notification on to the push URL, signed with its application
server key.</p>
<p>During initial registration, the application could also provide values
for the TTL, Topic, and Urgency headers as defined in RFC8030. The proxy
would encode those into its subscription URL alongside the push URL and
use them when forwarding change notifications.</p>
<p>I think this satisfies all requirements of both specifications while
being just as scalable to tiny single-person installations as it is to
multi-data-center worldwide deployments.</p>
<h1 id="same-origin-strikes-again">Same-origin strikes again</h1>
<p>There’s just one problem, which reveals itself in several different ways.</p>
<p>URLs that are of interest for use with WebSub are, almost by definition,
not at the same origin site as any JavaScript application that would try
to subscribe to them. If they were part of the same site, the
application wouldn’t need WebSub to find out about changes.</p>
<p>JavaScript can’t access cross-origin URLs without the target site opting
in using <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS">CORS</a>, and I doubt any WebSub deployments opt in. That means
the client web app can’t access pages to check whether they support
WebSub, and it can’t reliably send subscription requests either.</p>
<p>We could work around those issues by making the WebSub proxy also be a
CORS proxy, forwarding requests to arbitrary web sites. But that opens
up a variety of security and abuse concerns, and also could make the
proxy significantly more expensive to run.</p>
<p>Alternatively, there are existing CORS proxies out there, though as far
as I can tell none of them support the POST requests necessary for
establishing a subscription at a hub. I’m guessing that’s because of the
same potential for abuse.</p>
<p>On top of that, once you have a subscription going, WebSub delivers the
current contents of the page to you every time it changes… but we
can’t count on being able to deliver that through a push server, which
may have a length limit as low as 4kB. So the application needs to
request a page that’s already been delivered to the proxy, which
suggests that it would be more efficient for the proxy to cache the copy
it received from the hub, and return that cached copy to the application
on demand. But that means having state that persists across multiple
requests, which again makes the proxy more expensive to operate.</p>
<p>And if the proxy is going to be a general CORS proxy or at least cache
WebSub-delivered content, then it might as well be fully stateful and
coalesce multiple subscription requests for the same page into a single
subscription at the hub.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In short:</p>
<ul>
<li>solving the small question of connecting WebSub with Push is
straightforward,</li>
<li>but the same-origin policy effectively negates all the advantages,</li>
<li>so we’re probably better off spending our time building
application-specific servers.</li>
</ul>
<p>Or in other words, the web is a terrible place and makes me cry.</p>
<h1 id="coda">Coda</h1>
<p>While I was digging into this, I did find one other direction the world
could have gone:</p>
<p>In parallel with the early development of PubSubHubbub, <a href="https://tools.ietf.org/html/rfc5989">RFC5989</a>
described a way to use <a href="https://en.wikipedia.org/wiki/Session_Initiation_Protocol">SIP</a> for notification of web page changes.
Since mobile phones already use SIP for voice call setup, I could
imagine adopting it for mobile push notifications as well. If that was
exposed to applications, standards like RFC5989 would have provided
direct interoperability between mobile apps and the kinds of web sites
that use WebSub today.</p>
<p>But SIP is quite different from HTTP, while WebSub and Push both use
HTTP in mostly straightforward ways. So I suspect RFC5989 faces pretty
significant barriers to adoption, comparatively speaking. At minimum, I
can’t find any evidence of anyone having ever implemented
RFC5989…</p>Jamey SharpThere are a lot of situations where it’d be nice to know when the contents of a web page change, and the modern way to do that is the WebSub standard, formerly known as PubSubHubbub.Authenticating shared web caches2020-06-13T18:33:00-07:002020-06-13T18:33:00-07:00https://jamey.thesharps.us/2020/06/13/authenticating-shared-web-caches<p>I was doing a mental exercise and it went someplace that I think is not
terribly useful but still interesting enough to share.</p>
<p>Say you want to give somebody a copy of a web page that you’ve
previously retrieved, and you want to convince them that what you’re
giving them really did come from the original web server at some point.</p>
<p>This could be useful, for example, with the Internet Archive. If I want
to contribute my own crawl results, how can the Archive verify that I
didn’t forge the pages I’m submitting?</p>
<p>Note that the recipient can’t just fetch the same page themself and
compare it to what you gave them. It may have changed since the time
when you retrieved it, or they may not have access to that particular
document.</p>
<p>What I really want is for the origin server to have digitally signed the
document. There’s an Internet Draft for <a href="https://datatracker.ietf.org/doc/draft-yasskin-http-origin-signed-responses/">Signed HTTP Exchanges</a> which
does exactly that, but for the purposes of this exercise I want to be
able to do it for any web server as it is deployed today, rather than
requiring support for a new protocol.</p>
<p>So how else can I get an existing web server to sign its response for
me?</p>
<h1 id="abusing-tls">Abusing TLS</h1>
<p>If I’m using HTTPS then the server is already sending me a digital
signature that’s publicly verifiable against their TLS certificate, in
the <code class="language-plaintext highlighter-rouge">CertificateVerify</code> message during the TLS handshake. Can I take
advantage of that?</p>
<p>Well, not exactly. That signature only covers the preceding parts of the
handshake. TLS uses an assortment of new symmetric keys for everything
else, all derived from a random key generated through a Diffie-Hellman
ephemeral key exchange.</p>
<p>The TLS handshake does guarantee that as long as neither party has
leaked any of the ephemeral keys, then the client can trust that the
server it’s talking to also posesses the private key corresponding to
the server’s public certificate.</p>
<p>Because the cryptography in use is based on symmetric keys, by
definition both the client and server share all the same keys after the
handshake. That means that either end could forge messages pretending to
be from the other. But that’s okay, because if I receive a message from
you that claims to be from me, I’m not fooled. The worst either end
could do is willfully fool themselves, and there are easier ways to do
that.</p>
<p>Those guarantees go out the window as soon as I try to convince a third
party of the authenticity of these messages, though.</p>
<p>The only way I can think of to do that is to reveal the ephemeral keys,
together with the full encrypted stream. At that point, the recipient
can verify that the entire stream was produced by somebody who had those
keys, and verify that the server’s offered certificate was valid at the
time of the conversation. But they can’t verify that I didn’t forge part
or all of the application data. In fact, they’re now in a position to do
their own forgeries!</p>
<h1 id="trusted-timestamping-proxy">Trusted timestamping proxy</h1>
<p>I think there’s a simple fix for that problem, although it makes the
solution much less useful than I hoped for.</p>
<p>Let’s introduce a simple proxy server. I’ll connect to the origin server
through this proxy. The proxy will log the bytes that are sent back and
forth through it, but it won’t understand them: it’ll only see the
encrypted TLS traffic. At the end of the connection, it will give me a
digital signature over a timestamp plus the complete connection log. I
can check that the signature from the proxy matches what I saw from my
end of the conversation, and discard the log if not.</p>
<p>(Introducing this proxy doesn’t change the security story for TLS: it
already had to be resilient against passive or active attackers sitting
between the client and the server.)</p>
<p>Now I can give you the signed connection log, plus the ephemeral keys
used during the connection.</p>
<p>If you trust that the proxy only produces signatures over traffic that
actually passed through it, then you have enough information to verify
that I’ve given you an authentic conversation with the origin server.</p>
<h1 id="unless">unless…</h1>
<p>Well, almost. If I can convince the proxy to connect to a server under
my control, that server can complete the handshake with the origin
server, then share the ephemeral keys with me via another channel, and
we can collaborate to conduct a completely forged conversation through
the proxy, getting its signature on traffic that the origin server never
saw.</p>
<p>Part of the process of verifying a TLS certificate involves checking
that the DNS name in the certificate matches the DNS name the client was
trying to connect to. We can’t make the proxy entirely trustworthy, but
we can make it less untrustworthy if it does its best to ensure that the
server it connects to corresponds to the requested DNS name.</p>
<p>At the least this means any DNSSEC signatures should be included in the
proxy’s traffic log, so a recipient can be convinced that the DNS
response came from the same origin.</p>
<p>I don’t know of any way to secure the IP connection, though.</p>
<h1 id="wilder-options">Wilder options</h1>
<p>Personally I feel like I’d be pretty willing to accept evidence from a
trusted timestamping proxy like this, assuming the proxy were operated
by a reasonably neutral party.</p>
<p>If that isn’t enough for you, I suspect there’s a way to use secure
multi-party computation to have the proxy and the client both
participate in the conversation, in such a way that neither is capable
of leaking the ephemeral keys during the conversation.</p>
<p>But this is already more complicated than I hoped for, and doesn’t offer
much advantage over just having people fetch pages from origin servers
themselves. So I’m calling it a day.</p>
<p>Reading the TLS 1.3 specification carefully enough to work all this out
was sure an interesting exercise though!</p>Jamey SharpI was doing a mental exercise and it went someplace that I think is not terribly useful but still interesting enough to share.The most general reverse proxy2020-04-24T15:15:00-07:002020-04-24T15:15:00-07:00https://jamey.thesharps.us/2020/04/24/most-general-reverse-proxy<p>Deploying web applications is hard to do, and there are a bewildering
variety of ways to do it. Today I’d like to confuse the matter further
by proposing yet another approach. As a preview, here are the components
I’ll discuss:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/TLS_termination_proxy">TLS termination</a> and when to not do it</li>
<li>TLS <a href="https://en.wikipedia.org/wiki/Server_Name_Indication">Server Name Indication</a> to route entire connections</li>
<li><a href="https://www.haproxy.org/download/2.1/doc/proxy-protocol.txt">PROXY protocol</a> to forward clients’ addresses without parsing the
request stream</li>
<li>Unix sockets to communicate with backend applications</li>
<li>Unix permissions to isolate applications</li>
</ul>
<h1 id="background">Background</h1>
<p>There is, of course, no one best way to make a web site available to the
world, because people’s requirements vary a lot. For example, maybe your
site is like this blog, which I write in Markdown and compile to static
HTML using Jekyll, and any static hosting service will do. Things get
more complicated if you want to do any server-side computation, or even
just control your response status codes and headers.</p>
<p>Personally, I want to be able to put a new experiment online quickly at
any time, and I want to be able to use any feature of the evolving web
(such as WebSockets or <a href="https://en.wikipedia.org/wiki/HTTP/2_Server_Push">HTTP/2 Server Push</a>), with any kind of
server-side framework, database, or other tooling. To that end, I
currently have a dedicated server hosted by <a href="https://www.hetzner.com/">Hetzner</a>, so that I can
try anything that I’m capable of making work on bare metal.</p>
<p>“Platform as a Service” (PaaS) offerings like <a href="https://www.heroku.com/">Heroku</a> are popular for
quick experiments that might turn into big projects. They impose
constraints on your application that make it easier to scale, at least
for them, and maybe also for you. For example, your application can’t
store any persistent data to its local disk; it has to use an external
service like a database or storage provider. This allows applications to
crash without losing data, and to migrate across physical machines
easily, and to be scaled up by running duplicate copies of identical
code on as many physical computers as necessary.</p>
<p>But I’m not concerned about making my web sites scale; I desperately
hope I <a href="https://medium.com/@jkriss/anti-capitalist-human-scale-software-and-why-it-matters-5936a372b9d">never create the next Google or Facebook</a>. A single
computer these days can serve an awful lot of people so long as you
don’t write code that’s <a href="https://accidentallyquadratic.tumblr.com/">accidentally quadratic</a> or otherwise
unnecessarily resource-intensive.</p>
<p>I don’t care about scaling and I don’t want constraints, so Heroku <em>et
al</em> offer exactly the wrong trade-off for me. But there is one feature I
love from providers like Heroku: As a developer, your deployment process
is little more than recording what software you need installed and then
running <code class="language-plaintext highlighter-rouge">git push</code>.</p>
<p>In order to get that simple workflow on my own server, I’ve been using
<a href="http://dokku.viewdocs.io/dokku/">Dokku</a> for a couple of years. Built on top of the <a href="https://github.com/gliderlabs/herokuish">Herokuish</a>
compatibility layer, Dokku provides a fully open-source single-server
PaaS that can generally run the same application images that Heroku can.
Dokku doesn’t have the breadth of one-click third-party integrations
available that Heroku does, but on the other hand, it allows you to use
arbitrary Docker settings such as persistent volume mounts which don’t
fit Heroku’s scaling model. Anyway, I never wanted to pay for any of those
third-party integrations, so I don’t miss them.</p>
<p>Unfortunately, I’ve found that Dokku still imposes constraints I don’t
like. I believe fixing all of them would push it too far from being
compatible with Heroku, and therefore it wouldn’t be the same project
any more. So I’ve been working out what I want from first principles.</p>
<h1 id="my-requirements">My requirements</h1>
<p>I want to share a single server and a single IP address among multiple
applications, possibly managed by different people, who may all distrust
each other to some degree. For my personal server these would be friends
who I just want to protect from accidental interference, but ideally it
wouldn’t be a big leap from there to real security.</p>
<p>Authorized people should be able to add new applications with minimal or
no involvement from the system administrator, depending on how much
trust the administrator wants to extend; but they had better not be able
to interfere with other people’s applications.</p>
<p>I want to use only standard Unix permissions to control how people with
access to the server can interact with their own and others’
applications.</p>
<p>Sharing an IP address requires multiplexing the HTTP and HTTPS ports
between the different applications, but the reverse proxy which does
that should be as minimal as possible. Minimalism is good for security,
but that isn’t the key reason. If the proxy processes the protocol
stream in any way, that can impose several different kinds of
constraints on backend applications:</p>
<ul>
<li>
<p>Any interpretation it does means that it can only forward traffic
which it understands, keeping applications from taking advantage of
the full features of HTTP/2 or other specifications. For example, as
far as I can tell it is currently impossible to implement <a href="https://tools.ietf.org/html/rfc8030">RFC8030</a>
(Generic Event Delivery Using HTTP Push, part of the <a href="https://www.w3.org/TR/push-api/">Push API</a>) if
<a href="https://nginx.org/">nginx</a> is your reverse proxy, because that server only supports
sending a push promise that it immediately resolves.</p>
</li>
<li>
<p>Imposing arbitrary limits, such as limiting uploaded file size or
providing a “Web Application Firewall”, means an application might
need those limits configured. There is no one-size-fits-all
configuration. That means untrusted people need to be able to change
the configuration of a shared component, which makes it way more
difficult to show that people can’t interfere with the configuration
of each other’s applications.</p>
</li>
</ul>
<p>Each application should be assigned one or more hostnames, which are not
shared with any other application. Sharing at the level of subpaths or
other properties of a request gets substantially more complicated. More
importantly, hostnames are special in HTTPS because they’re provided
early via TLS <a href="https://en.wikipedia.org/wiki/Server_Name_Indication">Server Name Indication</a>, so the system can make
routing decisions without inspecting the HTTP protocol stream at all.</p>
<p>I thought I wanted the proxy to provide <a href="https://en.wikipedia.org/wiki/TLS_termination_proxy">TLS termination</a>, so the
administrator can provide wildcard certificates and DNS records for some
domain. That would allow people to host applications at any subdomain of
that domain without having to configure TLS or DNS themselves. However,
the HTTP/2 specification says that browsers are allowed to send requests
for any domain listed in the server’s certificate over the same
connection. Per <a href="https://httpwg.org/specs/rfc7540.html#rfc.section.9.1.1">section 9.1.1 of RFC7540</a>:</p>
<blockquote>
<p>For example, a certificate with a <code class="language-plaintext highlighter-rouge">subjectAltName</code> of <code class="language-plaintext highlighter-rouge">*.example.com</code>
might permit the use of the same connection for requests to URIs
starting with <code class="language-plaintext highlighter-rouge">https://a.example.com/</code> and <code class="language-plaintext highlighter-rouge">https://b.example.com/</code>.</p>
</blockquote>
<p>That means routing based on SNI is only safe if each certificate is
associated with exactly one backend. So instead I require the backend
application to be responsible for provisioning its own TLS certificates,
probably using ACME/<a href="https://letsencrypt.org/">Let’s Encrypt</a>.</p>
<p>I’d like to have the option of using Docker, KVM, etc, but I also want
the option to <em>not</em> use any container or virtualization technology for a
given application. I want to run an application and all its
dependencies, such as databases, caches, and task queues, entirely under
my own privileges. I want to be able to use any software I can figure
out how to build, and it shouldn’t be necessary to run any of it with
elevated privileges.</p>
<p>It’s nice to support zero-downtime rolling upgrades even on a personal
server used primarily for experiments. I want to start up a new version
of an application and test that it’s working, before atomically
switching outside traffic over to it and gracefully shutting down the
old version. But I want to do that without reloading or reconfiguring
the reverse proxy, because again, otherwise it’s hard to show that
people can’t interfere with each other’s applications.</p>
<h1 id="solution">Solution</h1>
<p>I haven’t found any existing software that I’m fully satisfied with, so
I wrote my own, with a focus on minimizing the amount of code that has
to be trusted. I used <a href="https://www.rust-lang.org/">Rust</a>, for its combination of memory safety,
performance, and event-driven async/await syntax, to write two reverse
proxies:</p>
<ul>
<li><a href="https://github.com/jameysharp/sniproxy-rs">sniproxy</a> for routing TLS connections according to SNI</li>
<li><a href="https://github.com/jameysharp/httpdir">httpdir</a> for routing HTTP/1.1 requests according to Host header</li>
</ul>
<p>They follow very similar designs. Neither program takes any command-line
options, or even a configuration file in a traditional sense.</p>
<p>First off, you have to pass a listening TCP socket as stdin (file
descriptor 0). There are many ways to do that, whether using classic
tools like <code class="language-plaintext highlighter-rouge">inetd</code> or new alternatives like <code class="language-plaintext highlighter-rouge">systemd</code>. If you need to
listen on multiple sockets, run a separate copy of the proxy for each
socket.</p>
<p>This has several advantages:</p>
<ul>
<li>
<p>It keeps the implementation simpler and easier to review because there
are no configuration options to parse.</p>
</li>
<li>
<p>It means that the proxy itself never needs to have elevated
privileges, which are otherwise required to bind to port numbers below
1024.</p>
</li>
<li>
<p>A service manager can hang on to the socket and later pass it to a new
version of the proxy, while allowing the old version to finish any
connections it’s already handling and gracefully exit.</p>
</li>
<li>
<p>If the single-threaded proxy turns out to be CPU-bound, the
administrator can run multiple copies of it listening to the same
socket, and pin each one to a separate CPU core.</p>
</li>
</ul>
<p>Second, create a directory to hold your backend configuration,
containing a subdirectory for each hostname you want to serve. I’ll call
this directory <code class="language-plaintext highlighter-rouge">hosts/</code> here. When you start either of these proxies,
this top-level configuration directory must be its current working
directory.</p>
<p>Unix permissions on the <code class="language-plaintext highlighter-rouge">hosts/</code> directory determine who can create,
delete, or rename a host. You could make it writable only by root, but
if you want to be more permissive you can set the “sticky bit” (e.g.
mode 1777) so that, like <code class="language-plaintext highlighter-rouge">/tmp</code>, anyone can create a directory but can
only delete or rename directories they own.</p>
<p>Hostnames must be represented without a trailing dot (.) and in
lowercase ASCII. So for international domain names, use the
“<a href="https://tools.ietf.org/html/rfc5890#section-2.3.2.1">A-label</a>” form: if you’re hosting a web site at
“<a href="https://🕸💍.ws">🕸💍.ws</a>”, then you’d put the backend configuration for that
site in <code class="language-plaintext highlighter-rouge">hosts/xn--sr8hvo.ws/</code>.</p>
<p>Since you can’t create multiple directory entries with the same name,
the filesystem enforces that all configured hostnames are unique. It
does not enforce that all configured hostnames are legal or valid, but
the proxies will reject an illegal hostname without checking for its
presence in the filesystem, so any misconfigured backends will just be
silently ignored.</p>
<p>In this configuration, a directory like <code class="language-plaintext highlighter-rouge">hosts/jamey.thesharps.us/</code>
should be owned by the user and/or group who is authorized to manage the
application.</p>
<p>Each hostname subdirectory can have these files:</p>
<ul>
<li>
<p><code class="language-plaintext highlighter-rouge">http-socket</code> (required for httpdir): a Unix domain socket that your
backend application is listening for HTTP/1.1 connections on.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">tls-socket</code> (required for sniproxy): a Unix domain socket that your
backend application is listening for TLS connections on.</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">send-proxy-v1</code>: if this file is present, sniproxy will prefix every
connection to this backend with a <a href="https://www.haproxy.org/download/2.1/doc/proxy-protocol.txt">PROXY protocol</a> v1 header.</p>
</li>
</ul>
<p>Any of these sockets or directories may be symlinks, which allows
atomically switching them to connect to a different socket at any time.
However, those symlinks should be relative and should only point within
the same directory, because ideally the proxy should be run in a chroot
and so may not have access to sockets located elsewhere.</p>
<p>You can change the configuration without restarting the proxies: they
look up the target socket each time a connection comes in, and don’t
need to know which hostnames to serve in advance. This is especially
important if untrusted users will be managing any hostname
configurations because allowing them to restart or reload the reverse
proxy is risky.</p>
<h1 id="alternatives">Alternatives</h1>
<p>I examined a lot of existing options before giving up and writing my
own.</p>
<p>I’d like to give a special shout-out to <a href="https://git.gdritter.com/melvil/blob/master/README.md">Melvil</a>, a proof of concept
my friend Getty wrote some years ago, as evidence that my ideas are
hardly unique. I independently ended up with a very similar design,
although Getty was trying to put more functionality into the proxy
layer, while I aimed to identify the least configurable thing which
could provide the most flexibility to the backends.</p>
<p>Aside from Melvil, every alternative I found would have required me to
write shell scripts or similar to construct a suitable configuration
file from the directory structure I wanted to use. I was concerned about
the possibility of directory names which aren’t valid hostnames but
might lead to syntactically invalid configuration files if I substituted
them naively into a template. Also, any time some backend’s
configuration changed, I’d have to reconfigure or restart the proxy.</p>
<p>The classic TLS termination proxy is <a href="https://www.stunnel.org/">stunnel</a>, and I thought it
looked promising for a while, but I never quite managed to make enough
sense of its documentation to put together even a minimal proof of
concept. It seems to support a variety of protocols and configuration
options, but it isn’t clear whether they work together in all
combinations. I got particularly confused about whether turning on PROXY
protocol support would affect the incoming connection from the client or
the outgoing connection to the backend. Anyway I eventually decided I
didn’t want TLS termination at all, and that’s all stunnel really does.</p>
<p>All the remaining alternatives that I considered put both TLS and
HTTP/1.1 routing in the same application. I believe this is a task which
should not share any state across connections, so there’s no reason to
handle more than one socket per process. And there are good reasons to
isolate parsers for different protocols from each other: a buggy HTTP
parser shouldn’t be able to leak information from or deny service to
another TLS connection, and <em>vice versa</em>.</p>
<p>Part of what convinced me that routing based purely on SNI could work
was an existing <a href="https://github.com/dlundquist/sniproxy">SNI Proxy</a> project. I read that implementation
carefully and quite like a lot of the ideas in it. I didn’t have a lot
of confidence in the TLS protocol parser though—which says “This
was created based primarily on Wireshark dissection of a TLS handshake
and RFC4366”—and in fact I believe that implementation would break
if a client sent fragmented TLS records. Whether that’s actually a flaw
remains an open question, since well-behaved clients don’t normally do
that during the TLS handshake, but I decided I wanted to accept anything
that the TLS 1.3 specification allows.</p>
<p>I spent the most time fiddling with <a href="https://www.haproxy.org/">haproxy</a>. Eventually I had a
somewhat sketchy configuration that I was pretty happy with, which would
terminate TLS but use only SNI to decide which backend to route the
decrypted stream to. This was the point when I discovered that HTTP/2
specifically allows clients to reuse a connection for a different
hostname, as long as both hostnames appear in the certificate that the
server presented for that connection. Honestly I think aggressive
connection reuse is a good idea, but it made my prototype worthless so I
was pretty grumpy for a bit.</p>
<p>Anyway, terminating TLS meant that haproxy had to negotiate extensions
like ALPN before identifying what backend to connect to, rather than
letting the backend decide what extensions to support. I had to
hard-code things like “this server supports both HTTP/2 and HTTP/1.1”,
and then force the backend to deal with the result. In my experiments I
was quite happy with <a href="https://nghttp2.org/">nghttp2</a>’s <code class="language-plaintext highlighter-rouge">nghttpx</code> for providing transparent
HTTP/2 support to backends that didn’t have it natively. But then I
realized that haproxy could do the same thing, so I folded that into my
haproxy configuration… before abandoning that approach entirely.</p>
<p>One of the first tools I tried was <a href="https://containo.us/traefik/">Traefik</a>, which is designed around
letting some container management system provide the information about
which backends are configured. I thought about using the file backend
and pretending to be a container management system, even for my backends
that weren’t actually in containers. But I couldn’t see how to ensure
that a given hostname was only managed by the user or group I authorized
to do so, unless I wrote a custom plugin, and I didn’t want that level
of complexity.</p>
<p>I also tried coercing <a href="https://nginx.org/">nginx</a> into giving me this kind of delegated
authority over configuration. That seemed almost plausible! Because
nginx <code class="language-plaintext highlighter-rouge">include</code> directives require that the named file be syntactically
valid by itself, if I wrapped the user configuration inside a <code class="language-plaintext highlighter-rouge">location
/</code> section, I could have relatively simple scripts generate templated
configuration that would at least somewhat isolate backends from each
other. Of course there are some kinds of configuration someone might
reasonably want that isn’t allowed inside a <code class="language-plaintext highlighter-rouge">location</code> section, but this
seemed like it would support most of the common needs at least.</p>
<p>But a lot of nginx configuration directives can have global effects. For
example, if caching is turned on, any location can set an arbitrary
cache key, potentially overwriting cache entries for other hosts.
Turning off caching mitigates that particular problem, but then backends
might need to configure their own caching layer, possibly by running
another copy of nginx. If they have to manage a more capable reverse
proxy anyway, why put another one with all that code sitting unused in
front of it?</p>
<p>Anyway, for this approach to work I would need to audit every
configuration directive of every module, and re-audit on every upgrade.
So that’s definitely not an option.</p>
<p>Finally, I also considered <a href="https://caddyserver.com/">Caddy</a> and <a href="https://h2o.examp1e.net/">h2o</a>, but both have a single
configuration file with no clear method for safely templating it from
untrusted configuration fragments, so I didn’t see much advantage to
examining them further for this purpose. They do both look very
promising for backend use, though; h2o was easier to configure as a
quick static file server than anything else I tried.</p>
<h1 id="results">Results</h1>
<p>In total I wrote about 650 lines of Rust source for the two reverse
proxies and I can explain every line of output from running either one
under strace, so I feel I did pretty well toward my minimalism goals. I
spent about two weeks putting these tools together, once I gave up on
the existing alternatives.</p>
<p>70% of that source code is in sniproxy, since it contains my parser for
extracting the SNI extension from the initial ClientHello message. It’s
also a bit more sophisticated about handling signals, graceful
shutdowns, and long hostnames, so httpdir might yet catch up somewhat in
complexity. On the other hand, httpdir compiles to a much larger binary,
because it pulls in an entire HTTP implementation from the <code class="language-plaintext highlighter-rouge">hyper</code>
library.</p>
<p>I haven’t reached the point where I can <code class="language-plaintext highlighter-rouge">git push</code> to deploy my
applications behind these proxies, but at this point the only other
thing an administrator has to provide is a per-user service manager that
starts at system boot. That could be by running <code class="language-plaintext highlighter-rouge">loginctl enable-linger</code>
on a systemd installation, or starting per-user copies of <code class="language-plaintext highlighter-rouge">daemontools</code>
or <code class="language-plaintext highlighter-rouge">supervisord</code> or anything else for each user to add their own
services to.</p>
<p>As long as users have some way to set up long-lived services that get
started again automatically after a reboot, and the ability to add host
configurations for httpdir and sniproxy, it’s possible to build the rest
of the glue needed for a “PaaS” without administrative privileges. I
have some plans for how I intend to do that using <a href="https://nixos.org/nix/">Nix</a>, but I hope to
see other people tackle that part of the challenge in different ways
too, because I’m certain I won’t get that right on the first try.</p>
<p>So please check out <a href="https://github.com/jameysharp/sniproxy-rs">sniproxy</a> and <a href="https://github.com/jameysharp/httpdir">httpdir</a> and see if they’d be
useful to you too!</p>Jamey SharpDeploying web applications is hard to do, and there are a bewildering variety of ways to do it. Today I’d like to confuse the matter further by proposing yet another approach. As a preview, here are the components I’ll discuss:Transit fares are not about money2019-12-12T12:45:00-08:002019-12-12T12:45:00-08:00https://jamey.thesharps.us/2019/12/12/transit-fares-are-not-about-money<p>I live in <a href="https://en.wikipedia.org/wiki/Portland_metropolitan_area">Portland, Oregon</a>, with a metropolitan area of about 2.5
million people. We have a pretty decent mass transit system here, called
<a href="https://en.wikipedia.org/wiki/TriMet">TriMet</a>. I don’t have a lot of experience with other cities’ transit
systems, but I do know a few things about this one.</p>
<p>A few years ago I wondered why TriMet made a particular policy decision,
and when I looked into it I discovered that my intuition about
government budgets was wrong. Today I’d like to walk you through the
things that surprised me, using a case study on what transit riders pay
to ride the bus here.</p>
<p>For 37 years (from 1975 to 2012), TriMet designated a small part of
their service area as “Fareless Square”. This area primarily covered
downtown Portland, which is the hub where most bus and light-rail routes
meet. As long as you were riding only within that area, you didn’t have
to pay anything to ride.</p>
<p>While I was a student at Portland State University in downtown Portland,
Fareless Square was fantastic, both for students like myself and for
guests the university hosted. For example, on the somewhat surreal
occasion when I as an undergraduate student got volunteered to host a
meeting of Larry Wall and the rest of the Perl 6 team, it was trivial
for me to take everyone to a nice restaurant for dinner; we just hopped
on the light-rail and didn’t have to deal with fares for a dozen
out-of-towners.</p>
<p>There were several reasons given for ending Fareless Square, but
officially the big one was that TriMet faced a budget shortfall and
predicted they’d get $2.7 million more in revenue per year if they quit
giving people free rides. (For more detailed history and other concerns
about Fareless Square, see KGW’s article: <a href="https://www.kgw.com/article/news/local/editors-picks/what-was-trimets-fareless-square-and-should-it-come-back/451692563">What was TriMet’s ‘Fareless
Square’ and should it come back?</a>) My memory may be faulty, but I
think I recall the total shortfall was expected to be around $15
million.</p>
<p>Compared to my household budget, those are big numbers! It made sense to
me at the time that a budget shortfall of that magnitude would be a big
deal. I was really unhappy about losing Fareless Square, but I accepted
their claim that this was necessary to solve their budget woes. But was
it really?</p>
<p>Because TriMet is a public agency, their annual budget report is also
public information, and a few years ago in a fit of idle curiosity I
actually read it. I’ve just updated my findings against <a href="https://trimet.org/budget/pdf/2020-approved-budget.pdf">TriMet’s FY
2020 budget</a>, which I encourage you to look at if you have
difficulty sleeping.</p>
<p>About 95% of TriMet’s revenue comes from three sources:</p>
<ul>
<li>60.1% from payroll taxes,</li>
<li>19.6% from federal and state funds,</li>
<li>and only 16.1% from passenger fares.</li>
</ul>
<p>Passenger fares are expected to come to about $112 million dollars this
year, a figure which has not changed significantly since the 2013 fiscal
year. (The budget’s exhibit 2 provides these numbers for every year from
1973 onward, if you’re curious.) And passenger fares are the smallest
portion of their revenue. So it’s safe to say $2.7 million was a drop in
the bucket.</p>
<p>Given these facts, I feel cheated. Fareless Square was awesome and they
took it away for such a paltry amount of money? I’m left believing that
finances were just an excuse they knew people wouldn’t argue with, not
the real motivation. (Again, see the <a href="https://www.kgw.com/article/news/local/editors-picks/what-was-trimets-fareless-square-and-should-it-come-back/451692563">KGW article</a> for other
possible concerns that may have motivated TriMet management.)</p>
<p>When I realized that, I was angry. I wondered, what if TriMet had gone
the other direction? What would happen if we made all of TriMet
fareless? It wouldn’t be the first transit system to provide <a href="https://en.wikipedia.org/wiki/Free_public_transport">free
public transport</a>. There are well-known advantages for rider
satisfaction and system efficiency if everyone can just get on and off
without the hassle of dealing with fares. And of course as a matter of
public policy it’s good for both traffic congestion and air quality if
more people can be encouraged to use mass transit.</p>
<p>There are quite a few important questions that would need to be answered
to make that policy proposal succeed, and I’m not interested in
addressing most of those questions in this post. For example, a fareless
system would have more riders and correspondingly higher costs, and any
funding proposal would need to take that into account. But I want to
continue focusing on things that surprised me when I threw out my
assumptions and actually examined TriMet’s budget.</p>
<p>Given the earlier numbers, we know that today we’d have a $112 million
budget shortfall that we’d have to make up somewhere. But
wait—collecting those fares isn’t free. TriMet has budgeted over $10
million this year for personnel, materials, and services in the “Fare
Revenue & Administrative Services Department”. There are some relevant
costs elsewhere in the budget as well.</p>
<p>So let’s say the actual shortfall is about $100 million.</p>
<p>Here again my intuition was wrong. “Obviously,” I reasoned, “the system
is funded by passenger fares and would collapse without them.” But funds
from local, state, and federal taxes are almost 80% of the budget, and
the fare revenue is only 16%.</p>
<p>For my whole life, a major controversy in Oregon government has been
around the use of property taxes to fund significant parts of state and
local government programs, so my first guess was that must be the taxes
funding TriMet. (If you’re curious about the history of this
controversy, you can read up on <a href="https://oregonencyclopedia.org/articles/measure_5_property_taxes/">Oregon ballot measures 5, 47, and
50</a>.) So based on that assumption, I did the math and worked
out that if Portland Metro area voters would approve a property tax of
34 cents per $1000 of assessed value, that would make up for the lost
fare revenue. But it turns out that hardly any TriMet funding comes from
property taxes. As far as I can tell, it’s at most $3.75 million of the
budget, and probably less than that. Given the controversy and the fact
that property taxes are not a significant part of TriMet’s budget today,
I didn’t consider this possibility further.</p>
<p>Where else could that money come from?</p>
<p>The largest portion of TriMet’s tax revenue comes from payroll and
self-employment taxes levied on employers across the region, amounting
to $382 million over the last year. What if businesses were interested
in getting on board with a fareless plan?</p>
<p>The 2019 <a href="https://trimet.org/taxinfo/">TriMet payroll tax rate</a> is approximately 0.75%.
For comparison, <a href="https://www.irs.gov/taxtopics/tc751">federal payroll taxes for Social Security and
Medicare</a> add up to 7.65% this year in employer contributions,
which is ten times as much. A few years ago when I asked my employer’s
accountant how much they actually paid in TriMet taxes, they considered
it practically a rounding error in their overall budget. Now that I’m
self-employed I pay TriMet taxes myself and they’re tiny for me too,
especially considering the <em>15.3%</em> tax I have to pay for Social Security
and Medicare.</p>
<p>My first instinct when I discovered that TriMet payroll tax revenue was
$382 million was to figure out how much that has to grow to cover the
$100 million of fare revenue, which works out to a 30% increase. But
asking businesses to pay 30% more taxes sounded like something they’d
fight vigorously, so I knew I needed to look at this more carefully.</p>
<p>How much money are we really talking about? Increasing the TriMet
payroll tax by 30% would bring it from around 0.75% to roughly 1%, which
is still not much. We can break this down by how much more an employer
would pay each year per employee based on that employee’s wages:</p>
<ul>
<li>
<p>$60 for full-time minimum wage employees: The <a href="https://www.oregon.gov/boli/WHD/OMW/Pages/Minimum-Wage-Rate-Summary.aspx">Portland Metro area
minimum wage</a> is currently $12.50/hour. Increasing the
TriMet tax to 1% would cost employers about $60 more per full-time
minimum-wage employee per year. In exchange, these employers could
make it easier for both their employees and their customers to get to
their place of business.</p>
</li>
<li>
<p>$150 for employees earning the average salary for the region:
According to Exhibit 7 in TriMet’s budget, “Local Economic Trends”,
the average pay per employee in this region is $62,717 per year. For
each employee earning that much, increasing the TriMet tax to 1% would
cost their employer about an extra $150 per year.</p>
</li>
<li>
<p>Less than the cost of monthly bus passes for employees earning up to
$500,000 annually: Many employers in the region buy TriMet tickets and
monthly passes for their employees, which cost up to $1,200 per year
per employee. If TriMet went fareless, then employers would no longer
need to pay that cost, but they’d pay more in the TriMet tax. Still,
they’d actually save money on every employee who makes less than
$500,000 per year!</p>
</li>
</ul>
<p>Once I broke the costs down this way, I learned that on the scale of a
metropolitan region of the size of the Portland area, even numbers like
$100 million just aren’t that big. That was the biggest surprise for me.</p>
<p>Many people won’t even consider a proposal like fareless transit on the
grounds that it sounds too expensive, but as I discovered, our
intuitions about finance mislead us when we think about budgets on the
scale of governments.</p>
<p>I don’t mean to suggest that fareless transit should be a no-brainer,
because these sorts of proposals face many more objections than just
“it’s too expensive.” What I would like you to take away from this story
are these points:</p>
<ul>
<li>
<p>If you would like to persuade someone that a government policy
proposal is a good idea and they object that it would be too
expensive, you may well be able to disabuse them of that notion. But
bear in mind that they are likely to have more concerns hidden behind
that one, and you should work to identify what those concerns are so
you can have a real dialog.</p>
</li>
<li>
<p>If you would like to persuade someone that a government policy
proposal is a <em>bad</em> idea, you should be very cautious about using the
argument that “it’s too expensive,” and make a sincere effort to
estimate what the proposal would cost. Otherwise you may find your
argument quickly torn to shreds. I encourage you to think carefully
about which of your values the proposal offends and make your argument
based on those instead.</p>
</li>
</ul>Jamey SharpI live in Portland, Oregon, with a metropolitan area of about 2.5 million people. We have a pretty decent mass transit system here, called TriMet. I don’t have a lot of experience with other cities’ transit systems, but I do know a few things about this one.