Over the last year, a lot more people have been using video-conferencing platforms, such as Zoom, because they couldn’t safely meet in person.

I don’t like siloed platforms like Zoom. I don’t want to devolve into a rant here, so I’ll just say: I believe that systems should always work in a federated way if that’s feasible, like how people at gmail.com can email people at outlook.com even though those are operated by completely different companies. And we’ve had email-like standards (SIP) for voice, video, and chat for over two decades; apparently, cell phone carriers have been using them for most of that time, and the current iteration (VoLTE) is actively deployed in 104 countries/territories around the world. So it’s not like we don’t know how to make the technology work.

Recently, I tried to counter my own biases and ask: is there a reason that I consider justified to build a non-federated conferencing service? In particular, what impact do architectural choices have on costs and efficiency for the various participants?

While considering these questions, I dug through a number of research papers on cryptography and on networking, to see whether academia can offer improvements for video-conferencing that industry hasn’t picked up yet.¹

I learned a lot, but ultimately what I can report today are negative results. I’m sharing the ideas I considered in the hopes that you may be inspired to apply some of these research results to some setting where they’re more appropriate. In addition, I’m sharing the process I followed to ultimately reject those ideas in the hopes that you may find the reasoning helpful in your own investigations.

Secure audio conferencing

Audio-only conference calls have an interesting property: it’s possible to mix everyone’s audio into a single stream. With video calls, each participant needs to receive a copy of every other participant’s video, which imposes practical limits on how many people can join a video conference. But no matter how many people are sending audio, if a central server handles mixing all the audio streams together, then each participant’s bandwidth usage is constant.

That sounds like a good justifiable reason to centralize audio-only conference calls, at least.

But I don’t want that centralized service to be able to listen in on my calls. So is there a way to send encrypted audio through the server and let it mix the audio together but without being able to decrypt it?

The answer is yes! In fact while discussing this with some friends, one of them reminded me that a team at Galois, where I used to work, built a prototype of exactly this application during their work on ShareMonad; they wrote about the experience in Launchbury et al’s paper, Application-Scale Secure Multiparty Computation. Their approach requires that you have three servers which you trust won’t share data with each other, which presumably means they’re operated by independent entities and maybe in different legal jurisdictions; as a result, the deployment story seems awfully complicated. Also I think if they’d used the approach I’m about to describe, their implementation would have been simple enough to not make for an interesting academic paper. 😅

Mixing audio fundamentally comes down to taking a sample from each source and adding them together. Because it’s just addition, there are a lot of tricks we can play. A variant of the ShareMonad-style approach, for example, splits a sample into multiple “shares”, where each share is random but the sum of all the shares equals the original value. So each client sends a different share of their samples to each server; each server adds all the shares it received, resulting in one share of the mixed result; and all the servers send their result shares back to the clients, which add up the shares to find the final mixed sample². If all the servers pool together the shares they’ve been given, then they can learn what the original inputs were; but if just one server keeps its shares secret, then none of the servers learn anything about whatever is being discussed on the call.

Another approach is to use homomorphic encryption. In an additively-homomorphic cryptosystem, I can give you some encrypted numbers, and there’s some procedure you can follow to find the encryption of the sum of those numbers, even if I don’t give you the decryption key. The most-used additively homomorphic cryptosystem was published by Paillier in 1999, and simplified and generalized by Damgård and Jurik in 2001. Much of the research since then has been pursuing “Fully Homomorphic Encryption” which supports not just addition but also multiplication (or other pairs of operators), because there aren’t that many applications where just having addition is enough. But audio conferencing is one of those rare exceptions!

Modern cryptosystems generally have a security parameter, where higher numbers are expected to be secure against adversaries with more computing power. In the case of Paillier, if the security parameter is 2048 bits, then the plaintexts it can encrypt are up to 2048 bits and the resulting ciphertexts are 4096 bits.³ This is important because we want as high of a security parameter as we can afford, but at reasonable security levels we can’t do very many encryptions or decryptions per second, so we need to make the best use we can of each one.

So we need to pack multiple samples into a single plaintext. If we’re using 8-bit audio samples, then we can fit about 250 of them into a 2048-bit plaintext⁴. That said, if we add two audio chunks packed this way together, any sample could overflow into the least significant bit of the next sample. So we need to leave an extra zero bit in between each sample. In general, if we have N participants in the conference, then the number of padding bits we need is the base-2 log of N, rounded up. If we leave ourselves 10 bits for each 8-bit sample, then we can have up to four participants and about 200 samples per plaintext. Note that the server doesn’t need to know which bits are padding, so the participants can negotiate amongst themselves regarding the number of samples per plaintext and the number of bits per sample.

Up to this point I’ve assumed that the audio is uncompressed, but there’s one very common audio compression trick that works just fine in this setting: MDCT. Used by everything from AAC to WMA, the MDCT transforms a sequence of time-domain audio samples (plus half as many samples from before and after the current block) into an equal number of frequency bins. The neat thing is that adding two blocks together and then transforming to the frequency domain gives the same result as transforming both blocks and then adding them. So a server performing Paillier additions doesn’t even need to know whether the plaintexts contain time-domain or frequency-domain representations. To compress the transformed blocks, the participants just need to agree on how many bits to use for each frequency bin. For example, they might choose to use fewer bits for frequencies outside the normal range of human speech.

Estimating feasibility

Okay, so let’s estimate how much compute power this plan takes. If we assume 200 samples per plaintext and a sample rate of 8kHz, that means each participant needs to do 40 encryptions and 40 decryptions per second, and the server needs to do 40*(N-1) additions per second for N participants. To get a rough idea of how fast Paillier operations are, I ran the microbenchmarks for the Rust paillier-common crate on my laptop, a 2015 model with a 3.1GHz Intel Core i7 CPU.

In my completely-unscientific testing, one encryption and one decryption of 2048-bit plaintexts together take about 7ms, which is about 28% of the 25ms we have before it’s time to process the next 200 samples. So my laptop could probably handle this scheme, but a phone might not be able to.

The Paillier additions are very fast though, by comparison. My laptop CPU, which is presumably less powerful than a reasonable server, could handle something over 4,000 participants simultaneously on one CPU core.

So this seems feasible and I think that’s a bunch of really cool results, but…

Is this worth doing?

No. It’s good enough to send a separate audio stream for each person who is speaking; that may even be a superior approach. The recipients can mix the audio into a mono stream for playback, and the streams can be end-to-end encrypted using bog-standard symmetric-key cryptography even if they’re routed through a central server.

In a conference call, it isn’t practical to have many people talking simultaneously, because then nobody can make out what anybody is saying. As long as the maximum number of simultaneous speakers is small, the amount of bandwidth required to listen in stays small no matter how many listeners join.

Anyway, the amount of bandwidth needed for compressed audio is tiny. High-quality lossy compression for speech needs maybe 40kbps. Given the parameters I used above, my scheme would require 160kbps in each direction, so as long as there are fewer than eight simultaneous speakers, the simpler system is more efficient even with higher quality audio. Using an MDCT and fewer bits for some frequency bins helps, but it probably doesn’t make up that big of a difference.

I hope it’s clear that none of this means homomorphic encryption schemes, like the Paillier or Damgård–Jurik cryptosystems, are useless. Secure multiparty computation, like ShareMonad, isn’t invalidated as a research field by this result either. But I do think that audio conferencing isn’t going to be a fruitful application domain for any of these tools, because its challenges are better addressed using more straightforward solutions.

Efficient peer-to-peer video conferencing

Unlike audio conferencing, there are no shortcuts in video conferencing: if you want to send video from N people to some recipients, you need to send N times as much video data. Anyway I just got done arguing that the audio shortcut of mixing down to one stream isn’t actually useful, so anything we might say about video applies to some extent to audio too.

But maybe academic research can point the way to more efficient distribution of that video data across the internet?

Before we can investigate that question, we have to decide what we’re trying to optimize.

If you’re coming from the perspective of operating a video-conferencing platform like Zoom, you might look at the costs associated with routing all that video through your own servers and networks, and wonder whether a more peer-to-peer system might save you money.

If your perspective is you don’t trust video-conferencing platforms to keep your meetings secure, then you also might wonder whether you can convince the people you communicate with to use a peer-to-peer system.

On the other hand, if you operate some portion of the public internet, you might wonder whether there’s a way people could use less of your network for video-conferencing, leaving more room for whatever it was we did with the internet before the pandemic.

The bad news is that no matter what, there will be at least N*(N-1) video streams bouncing around for a video-conference where all N participants have their cameras turned on, because one way or another, each participant has to receive one video stream from every participant except themself. If the video is sent to a central server and redistributed from there, the minimum is N*N streams.

Note that since every stream has both a sender and a receiver, if you add up the bandwidth costs for all participants, they come to twice the number of streams.

The good news is that we can shift the costs of those streams around, and perhaps reduce the cost of the underlying infrastructure, by organizing the participants into different structures.

Centralized platforms

The simplest way to engineer a video-conferencing platform is to have every participant connect to a central server, which copies everyone’s video stream to everyone else. This is nice for the participants, who each need enough bandwidth for N video streams, which is the bare minimum. On the other hand, that central server faces N*N video streams, which places much harder limits on how many participants the platform can handle in one video conference. I might be able to handle one thousand simultaneous video streams on my gigabit fiber⁵, but then that poor server is facing one million streams for one conference.

The “one central server” plan also isn’t good for those folks operating the public internet. Imagine I placed that server in Europe but everyone in some conference call is in North America. Now there are N*N streams crossing the expensive trans-Atlantic cables even though exactly 0 of them are of interest to anyone on the European side of the ocean.

This isn’t the way any serious platform operates, clearly, but let’s set aside the centralized platforms for a moment and go to the other extreme.

Everyone sends directly to everyone else

The most straightforward way to make a video conferencing system peer-to-peer is to require every participant to open a direct connection to every other participant, with each sending the other their own video feed.

This option is, hands down, the most fair way to distribute the bandwidth costs: every participant needs enough bandwidth for 2*(N-1) streams. No more, no less. That’s up to twice what they needed in the centralized model, but on the other hand there isn’t anybody getting stuck with the dramatically higher N*N bandwidth costs.

Fairness isn’t necessarily a good thing, since participants often have different amounts of bandwidth available. Fairness could mean limiting the size of a video conference to the number of people that the least-well-connected participant has bandwidth for.

That said, direct connections open up the option of sending video encoded at a different bitrate for some recipients, at the cost of more CPU time spent on video encoding. So that tradeoff might pay off.

This option also may offer the lowest end-to-end latency, since each stream goes as directly as possible to its recipient. Or it may not, if some platform builds out a global network optimized for latency and can bypass networks that are optimized for throughput and cost. It’s fun to look at e.g. Zoom’s PeeringDB entry and speculate about what they’re doing with the peering exchange points that they have a presence in on every continent.

On today’s internet, not everyone can connect directly to everyone else, due mostly to NAT. But using direct connections by default and falling back to platform-provided TURN relays still cuts out a lot of the platform’s costs in the centralized model. In fact, operating a global network of TURN relays is currently a useful service in its own right; see, for example, Twilio’s Global Network Traversal Service. I just hope some day we don’t need such services any more.

Overlay multicast

There are often clusters of participants in different geographic locations—say, a few in San Jose, some others in Toronto, and let’s rope in the team in London while we’re at it. Maybe what we should do is pick one well-connected participant in each cluster who handles the long-distance video forwarding to one or two other clusters, and then that representative connects to everyone else in their cluster. If that’s too much for one of those representatives, then we can split that cluster into several smaller clusters, and the top-level representative just has to connect to the mini-cluster representatives.

Any structure like this can be seen as a tree, although in this context it doesn’t actually matter which node you think of as the root. Since every node except the root has a single parent, and every participant’s video is flowing either up or down the tree at every edge, this still results in N*(N-1) video streams, just like when everyone sends directly to each other. The difference is that an individual participant’s bandwidth requirements are now N*D, where D is the degree of that node, or in other words how many other participants it’s connected directly to.⁶

This lets us have a kind of “controlled unfairness”. Each leaf in the tree only connects to its parent, so it needs enough bandwidth for N streams, just like in the central server case. The interior nodes can connect to as many other participants as they can afford. Higher degree makes the tree more flat which decreases worst-case end-to-end latency but, of course, increases bandwidth costs for that participant.

The problem is that participants have incentives that run counter to doing what’s best for the group as a whole: in this context, they often want to minimize their own bandwidth costs. We have a lot of choices for how to structure the tree, but how do we know which ones people will actually use?

While I was researching this topic, I found a paper called “Repeated-Game Modeling of Multicast Overlays”, by Afergan and Sami, published in the proceedings of IEEE INFOCOM 2006. Their model was that only one source would be broadcasting and the root of the tree would receive the broadcast first, so they consider an additional incentive which is to be as close to the root as possible in order to get the lowest latency. (In fairness, that’s what most people mean by “multicast”; I’m abusing the terminology a bit because the same structure works here.) Although that incentive doesn’t apply here, most of their modeling and reasoning is still valid in this situation too.

I enjoyed reading this paper! It provided a convincing way to reason about individual incentives, and turned the results into system design principles for encouraging individuals to cooperate.

There are many other papers on the topic of “overlay multicast” which suggest different ways to structure these trees, so anyone who wants to go down this road has lots to work with. But…

Is this worth doing?

Specifically, should video conferencing avoid having a centralized platform but also avoid having every participant connect directly to each other?

This isn’t totally clear, but no, I don’t think so.

The first model, with a central server, is roughly the same as a tree where the root has all other participants as its direct children. I find this to be a useful way to think about how centralized platforms can scale to multiple servers: organize the servers in a tree, and let conference participants connect (as leaves of the tree) to whichever server is geographically closest to them.

Keeping the interior nodes of the tree under the control of a centralized platform means that all the incentives are aligned: the operators of those nodes care about the sustainability of the network as a whole, because it’s their network. They also have the ability to place those nodes wherever they’re most needed, rather than relying on the geographic distribution of the participants, and they have the ability to control the quality of the connections between those nodes.

In the peer-to-peer setting, requiring every participant to connect directly to each other doesn’t impose that much of a bandwidth cost. For two participants the cost is exactly the same; for three, it’s 33% higher. As the number of participants grows, the bandwidth cost that the peer-to-peer participants have to shoulder tends asymptotically toward twice the cost of using a central server.

Forming an overlay multicast tree is complicated because, as the repeated-game paper showed, protocol and policy choices must be made with the participants’ incentives in mind. Since the only advantage over direct connections is that some nodes save a little bandwidth by pushing part of their load onto others, I don’t think that complexity is justified.

Most of these considerations would go out the window if IP multicast had been widely deployed on the public internet. If the routing infrastructure at the IP layer supported sending only one copy of a stream across a particular link even when multiple recipients want it on the other end, then I would say everybody should just use that and call it a day. Every participant’s bandwidth costs would be from just N video streams, like in the central server case but without any central servers, and every video stream would cross any given network link exactly once. But the key there is that it has to be integrated with the underlying routing decisions, not bolted on as an application-level overlay.

sigh… maybe someday we’ll get an internet that does everything, but I’m not holding my breath.

Conclusions

N-to-N audio/video conferencing is tricky to make scalable, but there’s been significant demand lately. Given how much academic research hasn’t found industry adoption, it’s worth checking whether there are hidden gems that can drive the next big innovation. In this case, the research I evaluated didn’t yield any of those, which always feels disappointing.

But I hope you’ve found the process I went through interesting, at least, and perhaps picked up some tricks that you’ll be able to apply to other challenges you encounter some day.

Footnotes

Cryptography is a fascinating field with all sorts of strange and wonderful theoretical results. So it seems a shame that out of that entire field, the only things impacting most people’s life day-to-day are public-key encryption, symmetric-key encryption, and secure hash functions. Granted, these three tools have had a huge impact; all three are necessary to make the little “lock icon” in web browsers meaningful. (In addition, secure hash functions have a huge impact on the world’s electricity consumption via proof-of-work blockchains such as Bitcoin; most tools have both good and bad uses, after all.)

More generally, there’s a lot of academic research out there that has not found applications besides publishing more research papers. That doesn’t mean it was a bad idea to do that research! It may mean the right application for it just hasn’t appeared yet, or it may inspire other research that’s more immediately applicable, or it may just serve to show that people shouldn’t waste time doing that particular thing again. All of those outcomes are valuable. ↩
The ShareMonad implementation was made more complicated than this because they had the servers change between logarithmic and linear representations as well as clamping the mixed samples back into valid dynamic range. They also generate a different mix for each client, excluding that client’s input while mixing all the other streams together. But all that work can be done on the clients, either before encrypting their individual samples or after decrypting the mixed samples. The only work which requires seeing samples from multiple clients is the addition step, and that part is easy. ↩
The Damgård–Jurik cryptosystem supports a parameter called s, where the plaintext can be s times as long as the security parameter, and the ciphertext will be s+1 times as long, for a ciphertext expansion rate of (s+1)/s. Paillier is the special case where s is 1, so the ciphertext expands by a factor of 2 compared to the plaintext; higher values of s are more space-efficient but a little slower per bit.

This was not clear to me from the Wikipedia articles on the subject, so to get the whole story I read the paper “A Generalization of Paillier’s Public-Key System with Applications to Electronic Voting” by Damgård, Jurik, and Nielsen. That’s a fascinating paper in its own right, including a bunch of useful zero-knowledge proofs that are relevant to electronic voting. It also includes performance measurements for a 750MHz Pentium III, which were not super helpful for my purposes. ↩
Although 256 8-bit values fit in 2048 bits, a Paillier plaintext must be numerically less than the public key, which is the product of two 1024-bit prime numbers, and is therefore some number smaller than 2^2048. ↩
Yes, yes, I know, if I try to display 1,000 video streams on my screen, they’ll all be so small as to be useless. An alternate title for this post could have been “Asymptotic complexity isn’t everything.” ↩
I find it informative to consider some specific shapes that an overlay multicast tree may take on:

In the extreme case where every node except the last has exactly one child, the two nodes at the ends (the root and the leaf) have degree 1 and the longest possible delay from each other; everyone else has degree 2, so they need only slightly more bandwidth than in the case where everyone connects directly to each other.

If the tree is a balanced binary tree, then about half the participants have degree 1, while the rest have degree 3, except for the root which has degree 2. ↩