Z3 adventures and fast Boolean matching
Over the past few weeks I’ve been fiddling with Z3, a theorem prover and SAT/SMT solver from Microsoft Research. Z3 makes it pretty easy to write down some NPcomplete problem instance and then push a button to get the answer. (Well, hopefully. I mean, they’re NPcomplete problems, it’s hard.) It’s also open source (released under an MIT license), and, as I’ve learned, it has a delightful maintainer in Nikolaj Bjørner.
Today I’m going to first briefly tell the story of my recent adventures with Z3; then the bulk of this post will be a deep dive into testing equivalency of Boolean functions, and the experiments those adventures have motivated me to try.
Z3 adventures
At first I was just playing with Z3’s Python bindings, but I started generating problem instances where Z3 took much longer and used much more memory than if I just exhaustively tested every possible solution, so I dug deeper into what it was doing. It turns out that Z3’s algebraic simplifier was actually making my particular bitvector problems way more complicated, and bypassing that to just use the socalled “bitblaster” and SAT solver got me results in reasonable amounts of time.
However, although the bitblaster had support for bitwise OR and bitwise NOT operations on bitvectors, it was missing support for bitwise AND. People wouldn’t usually notice because the simplifier rewrites AND in terms of combinations of NOT with OR, but since I was trying to avoid the simplifier I hit this unusual corner.
I had to dig into the source code to figure out that the error I was getting was about the AND operator, and in the process I found that it didn’t look too hard to fix. So I wrote and submitted my first two Z3 pull requests—one to add the four lines of code needed, and another to improve the error message in case somebody else ran into a different missing operator. Nikolaj merged both patches impressively promptly.
But that experience got me looking more closely at how Z3’s bitblaster works, and I found some unreachable code in its implementation of multiplication. I started investigating how hardware engineers design digital circuits for multiplication, and came up with a patch that completely rewrote the bitblaster’s multiplier. Naturally, there’s a lot I don’t understand about how Z3 works and so I had some bugs, but Nikolaj was again very helpful and responsive.
When Nikolaj tested my patch’s performance across the SMTLIB benchmark
suite for “quantifierfree bitvector” (QF_BV
) problems, it showed an
average 20% speedup across those benchmarks, which I was pretty excited
about! But it also made some of the benchmarks slower by orders of
magnitude, so Nikolaj politely encouraged me to dig into it more
carefully and offered tips that got me unstuck several times.
On some of the benchmarks that got much worse with my multiplier, after a lot of work it turned out that the only significant difference between the two implementations was that mine had some extra NOT gates floating around inside the tree of Boolean expressions. They were logically equivalent, in that pairs of these extra NOT gates would effectively cancel each other out, but that was pushing Z3 out of its comfort zone.
At that point I was back to looking at Z3’s simplifier. I made a small change to how it handled NOT gates mixed into trees of XOR gates, such as the bulk of the adders in a multiplier, and suddenly those benchmarks were fast again.
Nikolaj wanted to see the effect of that change independently of the
rest of my changes and wrote a much more comprehensive patch. That
change by itself made the QF_BV
benchmarks an average of almost 40%
faster, improving some benchmarks by an order of magnitude and reducing
the number of benchmarks that couldn’t finish within 10 minutes by 10%.
It turned out that with the simplifier fixed, my rewritten multiplier was actually slower, on average, than the old version, but it happened to generate cases the old simplifier was good at in enough benchmarks to bring up the average.
I’ll be honest, I was disappointed, because my multiplier was pretty cool, if you think multiplication is cool. But as consolation prizes go, I’m definitely satisfied by getting to say that I found a 40% speedup opportunity in a tool that’s already as tuned for performance as Z3 is. I’m also very pleased with the experience of working with Nikolaj on this. We did fix up the unreachable code I’d originally spotted, too.
And then Nikolaj nerdsniped me by suggesting that there could be even bigger gains from replacing parts of the algebraic simplifier with something based on “cutset enumeration”, which I’d never heard of. I’ll summarize it as splitting a complicated Boolean function into every possible combination of simpler functions and deciding which of those combinations is best. So now I’ve spent the last week reading up on that topic and “logic synthesis”, and thinking very hard about Boolean truth tables.
The truth table stuff is what I want to share next, partly because I’m still thinking about the cutset stuff. But mostly, I needed to know how much compression is possible of a lookup table keyed by Boolean functions, before I could fully evaluate whether anyone’s methods for splitting a complicated Boolean function into many simple ones seemed worth doing.
Finding functionally equivalent Boolean circuits
There are 2^N rows in any truth table for a circuit with N inputs. That means that the number of possible truth tables for circuits with N inputs is 2^(2^N). So there are 256 possible 3input truth tables, 65,536 4input truth tables, and more than 4 billion 5input truth tables. These numbers grow way too quickly.
Fortunately, a huge fraction of those truth tables are equivalent up to a few simple transformations, called “NPN” classes: Negating some of the inputs, Permuting the inputs, and Negating the output. The 65,536 4input truth tables only have 222 NPN equivalence classes, so if you have a procedure which can take any truth table and tell you which NPN class it’s in, then you’ve dramatically reduced the number of cases you have to handle.
For example, I’m looking into precomputing and storing optimal Boolean circuits implementing each of those 222 representative 4input truth tables, even if doing the same for all 65,536 possible truth tables takes too much space.
Actually, I claim there are only 208 “interesting” 4input truth tables; the other 14 ignore one or more of the inputs, so they’re equivalent to a smaller truth table. I want to always use the smallest truth table I can to represent each function. That said, you can store all sizes in the same lookup table by adding ignored inputs back on the left, if you want to.
The paper I’ve found most helpful in understanding these ideas is “Fast Boolean Matching Based on NPN Classification” by Zheng Huang et al. (I found it from the citations in “Fast Hierarchical NPN Classification” by Petkovska et al, which is great but assumes you understand the previous work in this area first.)
The problem is that for a given Ninput truth table, there are N!*2^(N+1) different ways to negate and permute it. To identify which NPN class a truth table belongs to, you need some deterministic procedure that always picks the same representative truth table among the equivalent possibilities. For larger N, checking all those possibilities gets expensive.
Heuristics
The good news is that there are fast heuristics which are almost as good as checking all the possibilities. Instead of identifying a single representative truth table for every element of the NPN class, different truth tables within the class may lead to any of a few different “semicanonical” representatives. Among the interesting 4input truth tables, where there are 208 NPN classes, the heuristics in “Fast Boolean Matching” identify 1,630 representatives, and I found easy tweaks that reduce that to 812 representatives.
But I didn’t have any idea why any of the heuristics should work until I spent several days fiddling with my own implementation of them, so here’s what I learned.
Truth tables as numbers
First, note that they’re interpreting each truth table as a number. If you already know how this works, bear with me; I’m going to be explicit about the parts I had to keep thinking through.
Truth tables are a way to describe the behavior of a Boolean function, where you write out every combination of the possible inputs as well as what the output of the circuit is for each combination. So “r = if C then T else F” has this truth table:
C T F  r
0 0 0  0
0 0 1  1
0 1 0  0
0 1 1  1
1 0 0  0
1 0 1  0
1 1 0  1
1 1 1  1
We can treat the result column as a single unsigned number written out in binary, where the inputs are used as an index into the bits of the number. If the allzeros input is treated as the leastsignificant bit of the number, then this ifthenelse table is represented by 0xCA. I believe the paper used the opposite convention, although it wasn’t explicitly stated, so they would represent this table as 0x53.
Thinking of truth tables as natural numbers means we can impose a total order on them, and that suggests one way we could define the representative for an NPN class: pick the lowestnumbered truth table from that class. (Picking the highest number works too, or any minimal or maximal element of some other total order, so long as we do it the same way consistently.)
These two choices affect each other, in that the notion of which truth table is numerically less than another one depends on how you turn it into a number. I got mediocre results until I started mentally swapping “negative cofactor” and “positive cofactor” everywhere I saw them in the paper, which is why I think they picked the opposite bitorder convention compared to the one that makes sense to me.
Output polarity
The first heuristic decides whether to negate the output of the truth table. It does this by choosing whichever way around has fewer 1s in the result column of the truth table.
If we want to pick numerically smaller truth tables, to a first approximation that means we want as many 0s in the mostsignificant bits as we can get. Negating and permuting the inputs allows us to rearrange what order the rows of the truth table are in, but negating the output is the only way to change how many 1s and 0s there are total. So it makes sense to try to minimize the total number of 1s first.
If the truth table has the same number of 0s as 1s, such as in the ifthenelse table above, then the paper says something I don’t quite understand about trying both choices. I’ve chosen instead to break ties by picking whichever option has the minimum numeric value.
Input polarity
The next heuristic decides, for each input, whether it’s better to negate that input or leave it alone. It does this by choosing whichever way around has fewer 1s in the “positive cofactor” than in the “negative cofactor”. (Actually the paper has that the other way around, but see above.)
I didn’t know what “cofactor” meant. I’ll go into more detail below in the “implementation details” section, but the essential idea here is: Count how many rows have a 1 in the result when input K is 1, versus how many when the same input is 0. These are the 1s in the positive cofactor, or negative cofactor respectively, with respect to input K. Negating an input swaps its positive and negative cofactors.
Again, we’re trying to get numericallylower tables, so the plan here is to move some of the 1s from the mostsignificant bits toward the leastsignificant bits.
The handy thing about this heuristic is that negating an input doesn’t change the number of ones in the cofactors of any other input, so we can apply this heuristic on the inputs in any order and get the same result.
If an input has the same number of 1s in both cofactors, the paper says to leave that input unchanged. I tried that, but then I also tried breaking ties by picking whichever option has the minimum numeric value. I found that this tiebreaker cut the number of semicanonical representatives in half, which is awesome for something that’s so quick to check.
Input permutation
The final heuristic picks an order for the inputs, once again by comparing the number of 1s in the cofactors. In the previous step we established that every input has no more 1s in its positive cofactor than in its negative cofactor.
Now we sort the inputs so that the input with the fewest 1s in its positive cofactor is the leftmost input. (Again, the paper says to use the negative cofactor instead, due to that different choice of bitorder conventions.)
Similar to the previous step, swapping inputs doesn’t affect the cofactors of any other inputs, so this can be done in any order while getting the same result.
This one is hardest for me to explain. Let’s bring back a 3input truth table and look at the positive cofactors visually:
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
The leftmost input has its positive cofactor entirely in the bottom half of the table. The middle input has its positive cofactor split into two groups of rows, where the bottom half is still at the very bottom of the table, but the other part is higher up. By contrast, the rightmost input spreads its positive cofactor as far across the table as possible.
We’re trying to get numericallysmaller truth tables, which means moving 1s from the mostsignificant (bottom) rows toward the leastsignificant (top) rows. So any input which has a lot of 1s in its positive cofactor should be more toward the right, thereby spreading those 1s toward the top of the table.
That said: After the second heuristic, if there are a lot of 1s in the positive cofactor, then there are at least as many in the negative cofactor, and moving a column to the right spreads those 1s toward the bottom of the table. So figuring out whether this helps is complicated.
So when some inputs have the same number of 1s in their positive cofactors, I break ties by sorting in decreasing order by number of 1s in the negative cofactors. On the 4input truth tables, this improved the results from 868 semicanonical classes to 812, and I expect it has more effect on larger tables.
Evaluation
These three heuristics don’t quite get us to the original goal of a minimum numeric value truth table. One example illustrating this is that, if the first two steps left a 1 in the mostsignificant (bottom) row, there is no permutation of the inputs which will move that row.
Still, in my experiments with 3input truth tables, there are 10 NPN classes that don’t ignore any inputs, and only 13 semicanonical classes after applying these three heuristics, so the heuristics are almost perfect at that size. On 4input truth tables, there are 208 interesting NPN classes and 812 semicanonical classes, which is still small enough to be a reasonable tablelookup. I didn’t write my prototype efficiently enough so I haven’t had the patience to exhaustively check larger truth tables yet.
Implementation details
If you want to go implement this idea, here’s a bunch of stuff I figured out, about the details that academic papers leave out as exercises for the reader.
There are various useful operations we can do on truth tables represented as numbers.
Note that I number inputs starting from 0 on the right, so in the ifthenelse truth table, F is input 0, T is input 1, and C is input 2.
Add ignored inputs
One operation is to extend an Ninput truth table to an N+1input truth table, where the new input is ignored. For example, the 1input truth table for “NOT A” is:
A  !A
0  1
1  0
If we introduce a new ignored input “B” to the left of A, then the truth table for “NOT A” becomes:
B A  !A
0 0  1
0 1  0
1 0  1
1 1  0
This is just duplicating the original truth table. On the numeric representation, this can be computed efficiently as:
table  (table << (1 << N))
Applying this operation repeatedly allows adding any number of ignored inputs.
Evaluate cofactors
The “negative cofactor” of a Boolean function, with respect to input K, is the subset of that function’s outputs where input K is false. Similarly, the positive cofactor is when input K is true.
A bitmask that is 1 everywhere input K is 0 can be quickly computed as the truth table for the circuit “NOT K”, which helps in looking at the negative cofactors of a truth table. (The reverse, for the positive cofactors, is the bitwiseNOT of that bitmask, or equivalently the bitmask shifted left by 2^K bits, whichever is more convenient to compute.)
If K is the leftmost of the N inputs in this truth table—and assuming arithmetic is being done in two’scomplement, as all current computers do—then its negative cofactor bitmask can be efficiently computed as:
(1 << (1 << K))  1
If K is not the leftmost input, then build the bitmask for a truth table where K is the leftmost input, and then follow the above procedure to add ignored inputs for each input to K’s left.
We’re going to use these negative cofactor bitmasks a lot, so as a practical matter it may be worth precomputing these masks for every size of truth table you plan to deal with, and every input at each size.
Actually, as long as you’re careful to ignore bits outside of your current size of truth table, you can use any mask for size N with smaller truth tables, so you may just need one bitmask for each possible input at your largest table size.
Count 1s
Counting the number of bits which are set within a word is a common enough operation, sometimes called “population count” or “popcount”, that x86 CPUs have an instruction specifically for this purpose. There are also efficient bitwiseparallel methods which, like many “bit twiddling” hacks, are quite puzzling until you figure out the trick.
Anyway, counting the 1s in a cofactor of an input is just a matter of bitwiseAND between the truth table and the appropriate bitmask, then applying a suitable popcount implementation.
Swap subsequences of rows
Several of the more complex things we can do to a truth table can be described as reordering the table’s rows. In general this would be a tedious thing to do to our numeric representation: the usual bitwise and arithmetic operators don’t allow swapping bits in arbitrary permutations.
Happily, we’re only going to need certain kinds of reordering, such as “swap rows 1 and 3 with rows 4 and 6,” and we can do that efficiently.
This procedure will take two parameters: a bitmask for the lesssignificant bits to swap, and an offset for how many bits to shift that mask left to find the moresignificant bits to swap with. In the example above, the mask is 0b1010 (identifying rows 3 and 1), and the offset is 3 (row 1 swaps with 4, and 3 swaps with 6, so both are 3 rows apart).
tmp = (table ^ (table >> offset)) & mask;
table ^= tmp  (tmp << offset);
I based this on “Swapping individual bits with XOR” from Sean Eron Anderson’s page of “Bit Twiddling Hacks”. I’ve extended Sean’s procedure to handle noncontiguous subsequences of bits, and to hide one of the offsets in the other two parameters, because it turns out to be more convenient that way for this.
Negate the circuit’s output
If we want to take the numeric representation of a truth table and get a table representing the same Boolean function but with the output inverted, just bitwiseNOT the table. In Clike languages,
~table
If you’re storing truth tables in machine words that are wider than the size of the table, you may want to clear any bits that aren’t part of the table after you do this.
Negate the circuit’s inputs
We’re going to want to take a truth table and get a table representing the same Boolean function but with one of the inputs inverted. Another way to say that is we’re swapping the negative and positive cofactors of that input. So we can combine two earlier procedures to do this.
In this case, the mask is the negative cofactor bitmask for input K, and the offset is:
1 << K
We could pass that mask and offset to the procedure for swapping rows, but in this case where every row is moving, it’s probably more efficient to do this instead:
((table & mask) << offset)  ((table >> offset) & mask)
Swap two of the circuit’s inputs
When swapping inputs J and K, the rows where both inputs are equal don’t change. We just need to swap the rows where J is 0 and K is 1 with the rows where J is 1 and K is 0.
Assume J is the leftmost variable of the two; if it isn’t, run this procedure with the parameters the other way around.
The mask for this swap is the bitwiseAND of the negative cofactor mask for J (the rows where J is 0) with the positive cofactor mask for K (the rows where K is 1).
The offset is the difference between the index of the first 1 in J’s column and the first 1 in K’s column in the truth table. This is:
(1 << J)  (1 << K)
For 3input functions like the ifthenelse example truth table above, there are three possible input swaps.
 Swap C with F: mask is 0b00001010, offset is 3
 Swap C with T: mask is 0b00001100, offset is 2
 Swap T with F: mask is 0b00100010, offset is 1
In general there are N*(N1)/2 possible input swaps for N inputs.
Note that any permutation of a circuit’s inputs can be described as a series of pairwise swaps, so applying this procedure multiple times allows arbitrary permutations.
Combining truth tables
Say you have the truth tables for arbitrary circuits A and B, and you want the truth table for an OR gate with the outputs of A and B as its inputs.
If A and B have the same inputs in the same order, then simply combining the numeric representation of their truth tables with bitwiseOR does the trick. The same works for any bitwise operator such as AND or XOR. (This also works for operators with three or more inputs, like ifthenelse, but there aren’t any bitwise ternary operators in common programming languages so that’s not super helpful.)
If A and B have different inputs, though, first you need to fix that. You can use earlier procedures to add ignored inputs and then reorder the inputs until both circuits match. This is probably easiest if you impose a total order on your inputs and keep all your truth tables’ inputs sorted according to that order.
Detecting and removing ignored inputs
Every additional input doubles the size of a truth table, so it’s useful to remove any input from a Boolean function so long as doing that doesn’t change the function’s behavior. This is an especially good idea after you’ve combined truth tables as above, because many Boolean algebraic simplifications fall out automatically. For example,

When the input circuits compute
X XOR Y
andX XOR Z
, combining their truth tables with XOR produces the same table asY XOR Z
, with X ignored. 
When the input circuits compute
X
andNOT X
, combining them with AND produces the constant FALSE, and combining with OR produces the constant TRUE; either way, X is ignored.
A simple way to check for this condition is to try negating some input using the earlier procedure. If the truth table doesn’t change, then that input has no effect on the function.
Here’s a slightly more efficient method to check if input K is ignored, given the negative cofactor bitmask for K:
(table ^ (table >> (1 << K))) & mask == 0
If input K is ignored, you can swap it with the leftmost input, then discard the mostsignificant half of the table, which at that point will be equal to the leastsignificant half. This is the reverse of the earlier procedure for adding an ignored variable.
Enumerating all NPN classes
I started investigating this topic partly because I wanted to know how much I could compress a list of truth tables, and one way to frame that question is “how many NPN classes are there?” So I needed a way to count them. But when I learned that finding a perfect representative of an NPN class can be slow, then I wondered how close a heuristic approximation can get, so I needed to count those too.
Earlier I mentioned “Fast Hierarchical NPN Classification” by Petkovska et al, which is all about optimizations you can apply if you’re trying to find the canonical NPN class for many different truth tables. That’s definitely applicable to the case of counting all NPN classes, but if you do it exactly as they describe then their method needs space proportional to the number of possible truth tables, which is prohibitive for even 5input tables.
I came up with a different method suited specifically for this counting task, similar to the Sieve of Eratosthenes for finding prime numbers. This relies on a fast implementation of the three heuristics.
Since we’re picking the numericallysmallest truth table as the canonical representative of its NPN class, start with the truth table with numeric value 0 and check successively larger truth tables.
Skip any truth table that ignores any of its inputs—including table 0, which is the constant FALSE and ignores all of its inputs. This is an especially fast test, so do it first. You don’t have to do this, but the input permutation heuristic as I’ve described it here won’t notice that the ignored inputs don’t matter, so these cases may have more semicanonical representatives than they would if you ran this same process on the smaller table size with the ignored inputs removed.
Next, check whether this truth table is a semicanonical class representative. If it isn’t, then it definitely is not a canonical NPN class representative, so skip it. The easiest way to check this is to run the truth table through the heuristics and see whether they return the same truth table unchanged.
Otherwise, when this truth table is a semicanonical class representative, we need to check whether we already saw any smaller semicanonical representative in the same class. There are two ways to do this, depending on whether you’d rather save memory or save time:

We could generate all negations/permutations of this truth table and check whether any of them are less than the current candidate, which is slow but makes the algorithm use a constant amount of memory. (I think?)

When we definitely have a canonical NPN class representative, we can generate all negations/permutations (which by definition are all greater than or equal to this truth table), filter them down to only the semicanonical ones, and add those to a set of known noncanonical truth tables. Then, a later candidate is canonical if it is not in that set, and we can remove elements from the set as we pass them. This is faster but it’s hard to put bounds on its memory usage.
Conclusion
I’ve had a great time exploring how Z3 works and interacting with its maintainer via GitHub, and I’m learning a lot.
I’ve also learned a ton in the last few days about methods for identifying Boolean functions that are equivalent to each other. I’ve read three papers on this subject, and there are… more, so I’ve probably only scratched the surface.
But just trying to understand those papers required a lot of background I didn’t have, so in this post I’ve tried to distill all that information into a summary that might help someone else get started.
In addition, I’ve described some improvements I found that weren’t in the papers I read, and many bittwiddling hacks to make a practical implementation as fast as possible.