Please pick a number between 1 and 10, and don’t tell anybody what you picked.
Got it? Great.
Now, I know you chose a number with deep personal meaning to you. Maybe the number you chose could cause you some trouble if other people knew it? If it helps you get into the spirit of this exercise, you can pretend your number represents your salary or the number of hours you sleep per day or the number of showers you take per week.
At this point, everyone reading this post has picked a number, and none of you want to share it. You especially don’t want to give it to Facebook or Cambridge Analytica or any of these big companies that make their money by selling what they know about you.
But you’re probably curious. How many other people picked the same number? How popular were the numbers you didn’t pick? (So long as you can keep your privacy, there are lots of good reasons to contribute to aggregate research, such as medical studies.)
We’ll answer those questions with a histogram. But I’ll walk you through techniques that ensure that you don’t have to reveal your secret number to anyone in the process of computing the histogram, and furthermore that someone who learns the information in the histogram still can’t find out what your secret number was.
There will be nothing new here for experts in differential privacy or secure multiparty computation; instead I’m aiming for a tutorial for programmers who aren’t familiar with these fields. Also, don’t rely on my explanations to implement anything where you really care about security. (This falls under the general advice, “don’t roll your own crypto.”) Rather I just want to spread awareness of several decades of research that are very relevant to us today.
What is “privacy”, really?
You might expect that a histogram is pretty good for keeping individual numbers secret. If a hundred people are counted in the same histogram, how could I prove which bin you’re in?
However: there’s a long history of people publishing data that seemed like it shouldn’t breach anyone’s privacy, only for someone to discover that in conjunction with other information, it wasn’t nearly as private as it seemed. An extreme case here would be if I know the secret number of everyone except you: I can then tell which bin you’re in by elimination. By the same token, if you and I are the only participants in this project, then I can subtract my contribution from the histogram and yours will be all that’s left. Privacy compromises can get much more complicated than either of these two cases, of course.
One well-known example was when Netflix published anonymized movie rental histories for their Netflix Prize competition in 2006. By 2007, researchers had demonstrated how to correlate the anonymized Netflix ratings with public movie ratings from IMDb, successfully re-identifying private movie rating records in the Netflix dataset, with implications for the privacy of people’s sexuality and their political and religious views1.
If you want to know whether a piece of information is going to deprive someone of their privacy, the lesson here is you can’t just reason about what an adversary can learn from the information you’re publishing. You have to consider everything they could possibly have known already.
So now we’ve gone from “this seems fine” to “this seems impossible!”
It turns out however that there are general techniques which can provably preserve privacy! The field is called “differential privacy”. Research in this area kicked off with Dwork, McSherry, Nissim, and Smith publishing “Calibrating Noise to Sensitivity in Private Data Analysis” in 2006, although that paper did not yet use the “differential privacy” terminology.
Differential privacy promises that the data generated from a dataset won’t be significantly different whether your data is in that dataset or not, so you might as well participate. Because a dataset that includes you is essentially indistinguishable from one you aren’t in, it doesn’t matter what the adversary knows!
There is a trade-off though between privacy and utility. Because the definition of differential privacy assumes the adversary could already know anything and everything, differentially private results must always have random noise added to them to obscure individual contributions to the results. Papers in this area are largely about deciding exactly how little noise you can get away with adding, because the more noise there is, the less useful the result is.
An alternative definition called “Distributional Differential Privacy” (DDP), introduced in Bassily, Groce, Katz, Smith: “Coupled-Worlds Privacy: Exploiting Adversarial Uncertainty in Statistical Data Privacy”, relaxes the assumption about an adversary’s abilities. Instead of assuming they could know everything, we define our assumptions about exactly what the adversary could know in terms of probability distributions. This requires some caution because if we omit the distribution that turns out to reflect some adversary’s actual knowledge, then all our privacy claims are invalid. But as long as we’re careful we can get more useful results by recognizing that real adversaries are not actually all-powerful.
One algorithm given in the original DDP papers is for privacy-preserving histograms, which is exactly what we need here2!
For now, let’s assume you believe I’m a trustworthy person, so you’re willing to give me your secret number and I’ll promise not to reveal it to anyone else. (We’ll improve on this assumption later.)
Once everyone has handed me their secret numbers, I can take the following steps:
I’ll pick a probability p and a whole number k based on how many people are participating and on how much privacy loss I’m willing to accept.
For each participant, I decide at random whether to use their number at all. Every participant needs to have the same probability, p, of being included. It’s very important that nobody be able to predict these decisions.
Count how many of the selected participants fall in each bin.
If any bin has a count less than k, report that bin as 0 instead.
Section 6.2 of Adam Groce’s PhD thesis, “New Notions and Mechanisms for Statistical Privacy”, contains the proof of the following statement, given that I follow the above procedure:
An adversary cannot learn about an individual, even if the attacker knows that the individual was in the sample. Consequently, the adversary cannot determine if a given individual was in the sample to begin with.
This should be a fairly comforting result. Sadly there are several details in the proof I haven’t been able to follow, so I have to take Groce’s word for it.
I’ll also refer you to the thesis for alternate choices of sampling distributions, as well as how to pick p and k.
Secure multiparty computation
What if you don’t trust me with your secret number? (I’m hurt, but I’ll get over it.)
“Secure Multi-Party Computation” (often abbreviated “MPC”) is a sub-field of cryptography where people who have secrets they aren’t willing to share with each other can nonetheless work together to compute some function combining all those secrets.
In this case we’re going to use MPC techniques so you don’t have to reveal your secret number to anyone. Instead of relying on me to combine everyone’s secrets, you’ll work directly with the other participants to run the above algorithm, and they won’t learn your secret number either.
Researchers have proposed a variety of MPC techniques that are remarkably generic: they can compute anything that could be computed by an arbitrary combination of either boolean logic gates or addition and multiplication. It turns out you can express those in terms of each other, but some algorithms are simpler in boolean logic and others are simpler in arithmetic.
The above histogram algorithm can be expressed fairly clearly in either style, but I think I can better explain a boolean circuit solution than the alternatives based on number theory, so let’s focus on that.
Using boolean circuits
The boolean-circuit style of MPC originates in a 1987 paper called “How to Solve any Multi-Party Protocol Problem” (by Goldreich, Micali, and Wigderson). That paper focused on presenting a constructive proof that all polytime-computable functions can be computed privately. Which is super cool, but they didn’t address questions of efficiency, or a variety of other implementation details one might care about.
Instead let’s look at a paper that follows the above construction but makes specific implementation choices and evaluates performance on a real implementation. Choi et al published “Secure Multi-Party Computation of Boolean Circuits with Applications to Privacy in On-Line Marketplaces” in 2012. As a non-expert in this field, I found their paper relatively easy to follow, which is somewhat unusual in cryptography papers! The paper provides a nice overview together with plenty of citations to papers describing the fundamental building blocks. The authors also published the source code of their implementation, if you want something to play with, although I haven’t tested whether it still builds.
In this framework, every computation is expressed as a circuit composed of AND gates and XOR gates3. If you’re used to writing software, at first glance this feels super limiting. Where are the loops and if-statements and function calls? How do I do math or manipulate text or construct fancy data structures?
Hardware designers, on the other hand, deal with this all the time.
- Function calls are replaced by just copying another instance of the circuit everywhere it’s needed.
- There are tons of known boolean circuits for doing arithmetic.
- If-statements involve computing the result of the conditional expression “c” as well as both the true and false branch results “t” and “f”, then computing “(c AND t) OR (NOT c AND f)”.
And so on. A task that takes just a few lines of code in your favorite programming language might blow up into a giant circuit, but still, many more things are possible than one might expect.
Once you’ve described your program as a boolean circuit, I encourage you to read either of the papers I cited above to learn how to turn that into a secure computation4. But I’ll note a couple of important details now:
- In this system, every participant splits up their secret inputs into random parts, and assigns one part to each of the other participants. The “real” value is the XOR of all the parts together, but since each participant sees only one of the random parts, nobody can draw any conclusions about what the actual value is.
- Because of the way the parts are distributed, each participant can compute their part of the result of an XOR by simply doing an XOR of their share of the inputs. So XOR operations are super easy.
- However, AND operations require pairwise communications between all participants, as well as cryptographic operations, so they’re relatively expensive.
- At the end of the computation, all the participants’ parts can be combined to get the final answer.
As a warm-up, let’s devise three kinds of circuits, which we’ll then rearrange and combine in various ways to solve the histogram problem.
The first circuit we need adds two 1-bit numbers and a 1-bit carry input, producing their 1-bit sum and a 1-bit carry output. Hardware designers call this circuit a full adder, and if Wikipedia doesn’t satisfy your curiosity you should be able to find plenty of details in any introductory digital circuits textbook. I learned about them from a children’s book, personally, but I may have had a strange childhood5.
A full adder needs 3 XOR gates and 2 AND gates6. Remember, in this style of multi-party computation, XOR is fast while AND is slow.
To add numbers that are N bits wide, the simplest option is to chain together N copies of the full adder circuit. Connect the carry-output for the least-significant bit to the carry-input of the next bit, to form what’s called a ripple-carry adder. This means we need 3·N XOR gates and 2·N AND gates.
This is a complete boolean circuit in its own right, and at this point we could use Choi et al’s implementation of multi-party computation to securely find the sum of any number of secret inputs.
The next circuit we’ll build compares two N-bit numbers A and B, producing a 1-bit output that is 1 if A<B and 0 otherwise.
- If the most-significant bit of A and B are 0 and 1, respectively, then A is less than B.
- If they’re equal (both 0 or both 1) then we should compare the next most significant bit of each.
- Otherwise, or if all the bits are equal, A is not less than B.
This needs 4 XOR gates and 2 AND gates per bit.
At this point we can solve variants of Yao’s Millionaires’ Problem, a thought experiment in cryptography where two people want to determine which of them has more money without actually revealing how much they have to each other.
However, every time we’re going to use this comparison circuit in the final histogram circuit, it turns out that either A or B is a public constant, so some of the inputs to these gates are known in advance. That allows us to simplify the circuit to 1 AND gate, and between 0 and 2 XOR gates, per bit7.
Random number circuits
The third basic circuit we need is a little different, because it doesn’t correspond to any physical digital logic circuit. Instead, this one relies on a specific property of Choi et al’s approach to multi-party computation.
Recall that Groce’s private histogram algorithm requires us to decide at random whether to include any given person’s private data in the final output, in such a way that the adversary can’t predict our decision.
So to start with, let’s figure out how to generate a single random bit, equally likely to be a 1 or a 0. None of the participants are allowed to know the value of this bit, but everyone needs to be able to use it in further computation.
Remember that in Choi et al’s approach, a participant’s secret input bit is represented by giving every participant a random bit, constrained such that the XOR of all those bits produces the true value.
So to generate a secret random bit, instead of distributing shares of somebody’s secret input, each participant should just generate their own random bit, and together all those bits will represent a truly random bit8.
We can extend this to generate a random bit that is 1 with probability p. If we concatenate N uniform random bits, we’ll get a uniform random N-bit number. With probability p, that number will be less than p times 2^N, so we can use our earlier less-than circuit to get the desired output.
We just have to pick a large enough N that it can represent p in fixed point with enough precision. To put that another way, p needs to be close to an integer multiple of 1/2^N. On the other hand, since the comparison circuit uses N AND gates and those are expensive, it’s worth picking the smallest possible value of N that lets us represent p precisely enough for our needs.
With that background in mind, let’s sketch a circuit for privacy-preserving histograms!
First we have to think about how each participant’s input should be encoded. I can think of two reasonable ways to represent your input: either as the binary encoding of your chosen secret number; or as one bit per histogram bin, where exactly one of the bits is 1. Using one bit per bin simplifies the circuit a little, but means you have to take extra steps to ensure that nobody sneakily puts themselves in multiple bins. And you can do that with “zero-knowledge proofs”, but it’s probably better to use an encoding with no redundancy or illegal states in it.
In advance, everyone agrees on a probability p, a whole number k, and the shape of the circuit we’re about to evaluate. These are public parameters to the algorithm. If you believe the parameters aren’t good enough to protect your privacy, you can refuse to participate without revealing anything about your secret number.
If everyone’s secret number is encoded in binary, then we need to convert to a bit-per-bin representation. Electrical engineers have lots of designs for converting between binary and so-called “one-hot” encoding, but for simplicity just assume we’ll use the less-than circuit a bunch of times.
Next you all need to decide whose inputs are actually going to get included in this histogram. For each participant, generate a random bit that is 1 with probability p. Then AND that bit with each bin of that participant’s input.
The level of paranoia in our random number circuit may have seemed excessive. But to make the correctness proof hold, we’re relying on the fact that none of the participants ever learn what any of the random numbers actually were, and also that none of them can influence whether somebody gets counted or not, so long as even one participant is honest.
Use a bunch of full-adder circuits to add up the number of 1-bits in each bin.
For each bin, use the less-than circuit to check if k-1 is less than the count in that bin; the result will be 1 if the count is big enough and 0 otherwise. The final result for that bin is the AND of this bit with each of the bits of the count.
Ta-da! Using distributional differential privacy, we saw how to make sure that the histogram output doesn’t reveal more than we intended to. Now we’ve seen how to ensure that the inputs are never revealed to anybody either.
Histograms using arithmetic
Many functions are inconvenient to express as boolean circuits. I mean, you can do it, but the circuits get huge. So I’d like to give a brief sense of what this would look like in an arithmetic setting instead.
Instead of taking AND and XOR as our basic operations, we can choose multiplication and addition over integers. However, like machine arithmetic on CPUs, these operations wrap if the result is bigger than some implementation-specified constant.
Some parts of the histogram algorithm are a lot simpler in this setting. Adding up the count of participants in each bin takes one addition per participant, instead of 2 AND gates per bit per participant. Even better, addition is the cheap operation while multiplication is expensive, so that stage becomes basically free.
Other parts are more complicated, because in general you can’t turn addition and multiplication into “return 1 if A<B and 0 otherwise”. Fortunately in this setting we aren’t actually working with arbitrary-precision integers, so there are tricks you can play, such as detecting that a computation overflowed. I’m not going to go into detail here, but there are specific methods for implementing the less-than and random-number circuits we relied on in the previous section. If you’re interested, see “Multiparty Computation for Interval, Equality, and Comparison without Bit-Decomposition Protocol” by Nishide and Ohta for one approach.
Since all the primitives we need exist in this multiparty arithmetic setting, clearly we can implement private histograms this way too! If you’re looking for a challenge, you could go implement this algorithm both ways and find out which one is more efficient, because I honestly don’t know…
Well, the number you chose has remained your secret this whole time, even as you let people learn some aggregate information about a crowd of people that included you.
Recent news reports about businesses like Cambridge Analytica vacuuming up huge amounts of personal information to create creepy profiles of people have been scary wake-up calls regarding our privacy in the era of big data and algorithms. However, there are many positive uses for data collection, ranging from serious medical studies to helping people discover some entertaining thing they’ll enjoy.
With this post, I hope I’ve conveyed that it’s theoretically possible to get the personal and societal benefits of data collection, without the harms due to our personal data being used in ways we didn’t intend.
I’ve only touched on a few specific examples of research from the past several decades, and there’s a lot of work to be done on making these approaches efficient, scalable, and usable. But the foundations are there, so we should reject the corporate story that we have to give up all our privacy to get these benefits.
Narayanan and Shmatikov. “Robust De-anonymization of Large Sparse Datasets” ↩
Much of the differential privacy literature uses sums or counts as examples, because those are easy to analyze—and a histogram is just a collection of counts. So finding a histogram algorithm in a paper in this field is not actually very difficult. I’ve chosen to write about this particular algorithm for two reasons. First, it’s pretty straightforward to adapt to the secure multiparty computation setting, but it has some interesting twists that let me discuss a bit more about that setting. Second, I started off trying to solve a different problem, and this algorithm has nice properties for my use case, though I won’t discuss that further right now because I don’t have a proof yet that my modifications to the algorithm are actually safe.
If you’d like to compare this with an algorithm that was designed from the start to satisfy the original differential privacy definition in a multiparty computation setting, I recommend Bindschaedler et al, “Achieving Differential Privacy in Secure Multiparty Data Aggregation Protocols on Star Networks”. That algorithm is also fault-tolerant against some number of participants going offline partway through. ↩
The original proposal by Goldreich et al used NOT gates rather than XOR, but if you provide an always-true input, “NOT b” can be rewritten as “1 XOR b”. So Choi et al can compute everything Goldreich et al can, with the same efficiency. ↩
If you really get into the implementation details of the boolean circuit approach, you may also be interested in papers on “oblivious transfer”, such as Donald Beaver, “Precomputing Oblivious Transfer”, and Li et al, “Efficient reduction of 1 out of n oblivious transfers in random oracle model”. ↩
I learned about boolean circuits for addition from David Macaulay’s 1988 children’s book, “The Way Things Work”, which illustrates everything with the help of a herd of friendly woolly mammoths. Honestly I think it’s a fantastic book for curious people of all ages. Apparently there have been two newer editions since the version I read as a child, if you’re concerned about the book’s relevance thirty years later. ↩
There are several combinations of gates that can work to implement a full adder. Wikipedia shows the two AND gates being combined with an OR gate to produce the carry output. However, one of the AND gates computes A AND B, while the other has an input from A XOR B. Since those can’t both be 1 at the same time, the inputs to the OR gate also can’t both be 1 at the same time, so in this case OR and XOR will produce the same result. ↩
In a digital comparator where one operand is constant, the total number of XOR gates depends on several factors. If the left-hand side is constant, then the number of XOR gates is twice the number of 0 bits. If the right-hand side is constant, the number of XOR gates is twice the number of 1 bits, plus the number of 0 bits.
If you wanted to micro-optimize this circuit, you could note that on integers, A<B is equivalent to either NOT (B-1<A) or NOT (B<A+1). That means that if the constant is the left-hand operand you can swap it to the right, or vice versa, for the cost of an additional XOR gate.
Of course this is not worth doing, because in this system, XOR gates are basically free compared to AND gates. But I spent so much time thinking about this useless micro-optimization that, gosh darn it, I’m at least allowed to put it in a footnote. ↩
This protocol for generating a random bit struck me as the obvious thing to do, and simultaneously as way too easy for it to possibly be correct. It wasn’t at all obvious to me that the XOR of a bunch of uniform random bits would still be uniform random, or that a malicious participant couldn’t influence the result one way or the other. Fortunately my friend Joe Ranweiler dug up the perfect Mathematics Stack Exchange answer for this question: “How to prove uniform distribution of m⊕k if k is uniformly distributed?” That proof shows that as long as at least one participant generates their bit uniformly at random, it doesn’t matter what anyone else does: the result will still be a uniform random number. ↩