A number of the proposals for how to manage the COVID-19 pandemic rely on being able to determine who has come into contact with infected people and therefore are at risk of infection themselves. Singapore, Taiwan and Israel have already deployed phone-based tracking technology and several recent proposals for re-opening the US economy depend on some sort of contact tracing system. There has been a huge amount of work in this area (see the list here), with perhaps the best known effort being the joint announcement by Apple and Google that they would be building this kind of functionality into iOS and Android.
To some extent what’s going on here is just that this is a nicely packaged, accessible, technical problem — learn some things, keep others secret? Sounds like a job for crypto! — and so we have a number of approaches that are quite similar. However, the other thing you see is that these solutions embed quite different assumptions about how they are going to be used and what kind of privacy properties you need and that ends up giving you a variety of different designs.1
A Centralized System (BlueTrace)
Let’s start by looking at Singapore’s system, BlueTrace, which describes itself as a “Privacy-Preserving Cross-Border Contact Tracing” system. As shown in the figure below, BlueTrace works by having the health authority run a central server which issues each user a series of TempIDs, each of which is an encrypted token that contains the user’s identity and is good for about 15 minutes. When two devices encounter each other, they exchange TempIDs, so your device gradually accumulates a list of the TempIDs of all the devices you have come into contact with. If you test positive, you upload all those TempIDs to the health authority (your own TempIDs are irrelevant), which then decrypts them and is able to identify all the people that you might have infected and can take appropriate action.
The BlueTrace protocol provides good privacy against other people and limited privacy against the health authority. Specifically, other people never need to learn your COVID status at all (unless someone tells them), both because your TempIDs are encrypted and because they are kept on your device unless you test positive, and even then are sent only to the health authority. The health authority doesn’t learn anything until you test positive, but after that happens they learn all of your contacts. This is by design because the whole design explicitly assumes that the health authority will know people’s contact status and take action.
Unsurprisingly, many have concerns about a system which allows the government to see all your contacts (see, for instance, the Chaos Computer Club’s list of desirable properties or the ACLU’s principles). There have been a number of designs that are instead decentralized, notably the Apple/Google design and the DP^3T system designed by EPFL and ETHZ, and which are designed to allow people to determine whether they have been in contact with someone infected without allowing the health authority to determine people’s contacts. However, as we see below, there is some difficulty around exactly how much people should learn about the contacts they have had.
The specific details of individual proposals vary a lot but the figure below shows simplified design: whenever the app on your phone sees that it’s near another phone running the app it generates a random number and sends it to that phone; the other phone does the same. Each app remembers all the numbers it has sent and received so at the end of the day you end up with a pile of stored numbers. If you later test positive, you push some button on the app which publishes all of the values that you sent. Every so often, your app downloads the list of published values and looks to see if any of them matches the values received. If they do, that means that you have been in contact with someone who tested positive. Note the important difference from BlueTrace in that you upload the IDs you sent, not those you received, and so the health authority never learns who you came into contact with.
This particular design isn’t very efficient because it involves publishing a huge number of of values, and so many of the real designs involve generating the values in some deterministic fashion which makes publishing them more efficient, but it’s close enough to let us see the privacy properties. First, let’s make sure people learn what we expect them to learn: As expected, the operator of the system learns who has tested positive because they get to see who publishes their values2. Similarly, people get to learn that they have been in contact with someone who has tested positive by looking to see if their received values overlap with the published sent values.
Next, we need to ask if people learn anything besides what they were supposed to learn. Because the health authority only learns what messages infected people sent, it doesn’t get to trace their contacts. And as long as the numbers are random, you can’t use this to track someone. However, it turns out that you get to learn not only that you were in contact with someone who tested positive but also who tested positive as long as you record who you were near at the time you received each value. This doesn’t sound that terrible, but consider an attacker who puts up a combination phone/camera outside of a testing clinic. Whenever someone walks by he records their value and takes a picture of them. At the end of every day he looks to see which values have been published and then uses facial recognition to determine their real identities. This kind of setup is very cheap and it would be easy to learn the COVID status of many people.
It’s possible to mostly remove this attack at the cost of giving the operator more information: each user can upload all the values they receive and have the operator tell them if there is a match. This trades off one kind of privacy threat (third parties) for another (the health authority) and it’s worth noting that this form of attack is very hard with BlueTrace. With enough fancy cryptography3, it’s probably possible to get back to the “ideal” state in which the user just learns whether they have been in contact with a single person who was infected. However, there’s a tradeoff here: Users may want more information than just were they in contact with someone; for instance they might want to know when and for how long. If we design a system that hides this information from the user, then they may find the information less useful than if they were able to know “I was next to Joe for an hour and sneezed on me and now he’s positive”. It seems quite difficult if not impossible to design a system which lets people have enough information to feel like they understand their risk and doesn’t also make it possible for attackers with modest resources to learn a lot of people’s COVID status, because this is basically the same information.
What do we want, anyway?
It’s tempting, of course, to ask if one design is better than the others, but upon closer inspection, it seems like there are really three separate use models people have in mind here:
- Inform the authorities about who might need to be tested or quarantined.
- Serve as a sort of digital permission slip to access various services (see, for instance, this proposal by the Center for American Progress4).
- Inform people that they might have been infected so they can consider getting tested.
If you are trying to deploy the first kind of system, then it doesn’t make any sense to try to avoid the health authority learning who might have come in contact with infected people because the health authority staff need to reach out to them. On the other hand, if you are trying to deploy the second and third type of systems, then you probably do want to protect the user’s data from the health authority as much as possible, and then you have to ask how much you want users of the system to learn.
What this really comes down to is the question of what are we trying to accomplish? which in this case, means what do we want our contact tracing system to do?
- Is it providing information for users of the system or for public health authorities?
- What do we expect to do with this information? Notify people? Let them do things?
- How much are we comfortable with users learning about other people’s COVID status?
- How much are we comfortable with the operator learning about people’s COVID status?
- How much complexity are we willing to tolerate? This is a matter of both implementation cost and of user confidence in the system.
- Are we willing to force people to participate in this system?
Any system design necessarily embodies our answers to these questions, but these are fundamentally policy questions, not technology questions. Once we know the answers to that, then we will know what kind of system we want and may be able to design something that meets our needs.
Thanks to Chris Wood, Dan Boneh, Henry Corrigan-Gibbs, and Luke Crouch for helpful discussions on this topic. Thanks especially for Laura Thomson for the taxonomy in the final section.
- For technical readers, Cho, Ippolito, and Yu do a reasonable job of covering the desiderata and alternative designs here. ↩
- It’s possible to remove this property by having users submit their values anonymously. The Apple/Google system tries to split the difference by just requiring the operator to delete this data. ↩
- As the DP^3T white paper observes, allowing people to learn some information about other’s infection status is inherent in any system which allows users to determine if they have been in contact with an infected person. If you don’t see that many people and you learn about when you were infected, then you can infer who the report is about. There are a variety of mitigations which can reduce this risk but at the end of the day some level of exposure is just built into the system. ↩
- “Airline passengers must download the Contact Tracing app, confirm no close proximity to a positive case, and pass a fever check or show documentation of immunity from a serological test”. ↩