August 27, 2010
Why SpiderOak doesn’t de-duplicate data across users (and why it should worry you if we did)
One of the features of SpiderOak is that if you backup the same file
twice, on the same computer or different computers within your account, the 2nd
copy doesn’t take up any additional space. This also applies if you have
several versions of a file as it evolves over time — we only need to save the
new data blocks.
Some storage companies take this de-duplication to a second level, and do a
similar form of de-duplication across all the data from all their customers.
It’s a great deal for the company. They can sell the bytes of storage to every
user at full price while incurring zero additional cost. In some ways its
helpful to the user too — uploads are certainly faster when you don’t have to
transfer the data!
How does cross user data de-duplication even work?
The entire process of a server de-duplicating files that haven’t even been
uploaded to the server yet is a bit magical, and works through the properties
of cryptographic hash functions. These allow us to make something like a
fingerprint of any file. Like people, no two files should have the same
fingerprints, right? The server can just keep a database of file fingerprints
and compare any new data to these fingerprints.
So it’s possible for the server to de-duplicate and store my files, knowing
only the fingerprints. So, how does this affect my privacy at all then?
With only the knowledge of a file’s fingerprint, there’s no clear way to
reconstruct the file the fingerprint was made from. We could even use a
technique for prepending deduplicated files with some random data when making
fingerprints, so they would not match outside databases of common files and
However, imagine a scenario like this. Alice has operated the best BBQ
restaurant in Kansas City for decades. No one can match Alice’s amazing sauce.
Suddenly Mallory opens a BBQ joint right across the street, with better prices
and sauce that’s just as good! Alice is pretty sure Mallory has stolen the
recipe right off her computer! Her attorney convinces a court to issue a
subpoena to SpiderOak: Does Mallory have a copy of her recipe? “How would we
know? We have no knowledge of his data beyond its billable size.” Exasperated,
the court rewrites their subpoena, “Does Mallory’s data include a file with
matching fingerprints from the provided recipe file here in exhibit A?” If
we have a de-duplication database, this is indeed a question we can answer, and
we will be required to answer. As much as we enjoyed Alice’s BBQ, we never
wanted to support her cause by answering a 3rd party’s questions about customer
Imagine more everyday scenarios: a divorce case; a patent, trademark, or
copyright dispute; a political case where a prosecutor wants to establish that
the high level defendant “had knowledge of” the topic. Establishing that they
had a document about whatever it was in their personal online storage account
might be very interesting to the attorneys. Is it a good idea for us to be
even capable of betraying our customers like that?
Bonus: Deduping via Cryptographic Fingerprints Enables The Ultimate Sin
The ultimate sin from a storage company isn’t simply losing customer data.
That’s far too straight forward a blunder to deserve much credit, really.
The ultimate sin is when a storage company accidentally presents Bob’s data
to Alice as if it were her own. At once Bob is betrayed and Alice is
frightened. This is what can happen if Bob and Alice each have different files
that happen to have the same fingerprints.
Actually cryptographic hashes are more like DNA evidence at a crime scene
than real fingerprints — people with identical DNA markers can and do exist.
Cryptographers have invented many smart ways to reduce the likelihood of this,
but those ways tend to make the calculations more expensive and the database
larger, so some non-zero level of acceptable risk must of collisions be
determined. In a large enough population of data, collisions happen.
This all makes for an entertaining conversation between Alice and Bob when
they meet each other this way. Hopefully they’ll tell the operators of the
storage service, which will otherwise have no way of even knowing this error
has happened. Of course, it’s still rather unlikely to happen to you…
There’s a Public Information Leak Anyone can Exploit
Any user of the system can check if a file is already contained within the
global storage set. They do this simply by adding the file to their own storage account,
and observing the network traffic that follows. If the upload completes
without transferring the content of the file, it must be in the backup
For a small amount of additional work they could arrange to shutdown the
uploading program as soon as they observe enough network traffic to know the
file is not a duplicate. Then they could check again later. In this way, they
could check repeatedly over time, and know when a given file enters the global
If you wanted to, you could check right now if a particular file was already
backed up by many online storage services.
How might someone be able to maliciously use this property of a global
de-duplication system to their advantage?
- You could send a new file to someone using the storage service and know
for sure when it had arrived in their possession
- You could combine with the
href="http://en.wikipedia.org/wiki/Canary_trap">Canary Trap method to
expose the specific person who is leaking government documents or corporate
trade secrets to journalists
- You could determine whether your copyrighted work exists on the backup
service, and then sue the storage service for information on users storing the
There are also categories of documents that only a particular user is likely
How much space savings are we really talking about?
Surely more than a few users have the same Britney Spears mp3s and other
predictable duplicates. Across a large population, might 30%, 40%, or perhaps
even 50% of the data be redundant? (Of course there should be greater
likelihood of matches as the total population increases. This effect of
increasing de-duplication diminishes though: it is more significant as the data
set grows from 1 user to 10,000 users than from 10,000 users to 20,000 users,
and so on.)
In our early planning phase with SpiderOak, and during the first few months
while we operated privately before launch, we did a study with a population of
cooperative users who were willing to share fingerprints, anonymized as much as
was practical. Of course, our efforts suffered from obvious selection bias,
and probably numerous other drawbacks that make them unscientific. However,
even when we plotted the equations up to very large populations, we found that
the savings was unlikely to be as much as 20%. We chose to focus instead on
developing other cost advantages, such as building our own backend storage
What if SpiderOak suddenly decides to start doing this in the future?
We probably won’t… and if we did, it’s not possible to do so
retroactively with the data already stored. Suppose we were convinced someday;
here are some ways we might minimize the dangers:
- We would certainly discuss it with the SpiderOak community first and incorporate the often-excellent suggestions we receive
- It would be configurable according to each user’s preference
- We would share some portion of the space savings with each customer
- We would only de-duplicate on commonly shared and traded filetypes, like mp3s, where it’s most likely to be effective, and least likely to be harmful