August 27, 2010

Why SpiderOak doesn’t de-duplicate data across users (and why it should worry you if we did)

by with 18 comments

One of the features of SpiderOak is that if you backup the same file
twice, on the same computer or different computers within your account, the 2nd
copy doesn’t take up any additional space. This also applies if you have
several versions of a file as it evolves over time — we only need to save the
new data blocks.

Some storage companies take this de-duplication to a second level, and do a
similar form of de-duplication across all the data from all their customers.
It’s a great deal for the company. They can sell the bytes of storage to every
user at full price while incurring zero additional cost. In some ways its
helpful to the user too — uploads are certainly faster when you don’t have to
transfer the data!

How does cross user data de-duplication even work?

The entire process of a server de-duplicating files that haven’t even been
uploaded to the server yet is a bit magical, and works through the properties
of cryptographic hash functions. These allow us to make something like a
fingerprint of any file. Like people, no two files should have the same
fingerprints, right? The server can just keep a database of file fingerprints
and compare any new data to these fingerprints.

So it’s possible for the server to de-duplicate and store my files, knowing
only the fingerprints. So, how does this affect my privacy at all then?

With only the knowledge of a file’s fingerprint, there’s no clear way to
reconstruct the file the fingerprint was made from. We could even use a
technique for prepending deduplicated files with some random data when making
fingerprints, so they would not match outside databases of common files and
their fingerprints.

However, imagine a scenario like this. Alice has operated the best BBQ
restaurant in Kansas City for decades. No one can match Alice’s amazing sauce.
Suddenly Mallory opens a BBQ joint right across the street, with better prices
and sauce that’s just as good! Alice is pretty sure Mallory has stolen the
recipe right off her computer! Her attorney convinces a court to issue a
subpoena to SpiderOak: Does Mallory have a copy of her recipe? “How would we
know? We have no knowledge of his data beyond its billable size.” Exasperated,
the court rewrites their subpoena, “Does Mallory’s data include a file with
matching fingerprints from the provided recipe file here in exhibit A?” If
we have a de-duplication database, this is indeed a question we can answer, and
we will be required to answer. As much as we enjoyed Alice’s BBQ, we never
wanted to support her cause by answering a 3rd party’s questions about customer
data.

Imagine more everyday scenarios: a divorce case; a patent, trademark, or
copyright dispute; a political case where a prosecutor wants to establish that
the high level defendant “had knowledge of” the topic. Establishing that they
had a document about whatever it was in their personal online storage account
might be very interesting to the attorneys. Is it a good idea for us to be
even capable of betraying our customers like that?

Bonus: Deduping via Cryptographic Fingerprints Enables The Ultimate Sin

The ultimate sin from a storage company isn’t simply losing customer data.
That’s far too straight forward a blunder to deserve much credit, really.

The ultimate sin is when a storage company accidentally presents Bob’s data
to Alice as if it were her own. At once Bob is betrayed and Alice is
frightened.
This is what can happen if Bob and Alice each have different files
that happen to have the same fingerprints.

Actually cryptographic hashes are more like DNA evidence at a crime scene
than real fingerprints — people with identical DNA markers can and do exist.
Cryptographers have invented many smart ways to reduce the likelihood of this,
but those ways tend to make the calculations more expensive and the database
larger, so some non-zero level of acceptable risk must of collisions be
determined. In a large enough population of data, collisions happen.

This all makes for an entertaining conversation between Alice and Bob when
they meet each other this way. Hopefully they’ll tell the operators of the
storage service, which will otherwise have no way of even knowing this error
has happened
. Of course, it’s still rather unlikely to happen to you…

There’s a Public Information Leak Anyone can Exploit

Any user of the system can check if a file is already contained within the
global storage set. They do this simply by adding the file to their own storage account,
and observing the network traffic that follows. If the upload completes
without transferring the content of the file, it must be in the backup
somewhere already.

For a small amount of additional work they could arrange to shutdown the
uploading program as soon as they observe enough network traffic to know the
file is not a duplicate. Then they could check again later. In this way, they
could check repeatedly over time, and know when a given file enters the global
storage set
.

If you wanted to, you could check right now if a particular file was already
backed up by many online storage services.

How might someone be able to maliciously use this property of a global
de-duplication system to their advantage?

  • You could send a new file to someone using the storage service and know
    for sure when it had arrived in their possession
  • You could combine with the href="http://en.wikipedia.org/wiki/Canary_trap">Canary Trap method to
    expose the specific person who is leaking government documents or corporate
    trade secrets to journalists
  • You could determine whether your copyrighted work exists on the backup
    service, and then sue the storage service for information on users storing the
    work

There are also categories of documents that only a particular user is likely
to have.

How much space savings are we really talking about?

Surely more than a few users have the same Britney Spears mp3s and other
predictable duplicates. Across a large population, might 30%, 40%, or perhaps
even 50% of the data be redundant? (Of course there should be greater
likelihood of matches as the total population increases. This effect of
increasing de-duplication diminishes though: it is more significant as the data
set grows from 1 user to 10,000 users than from 10,000 users to 20,000 users,
and so on.)

In our early planning phase with SpiderOak, and during the first few months
while we operated privately before launch, we did a study with a population of
cooperative users who were willing to share fingerprints, anonymized as much as
was practical. Of course, our efforts suffered from obvious selection bias,
and probably numerous other drawbacks that make them unscientific. However,
even when we plotted the equations up to very large populations, we found that
the savings was unlikely to be as much as 20%. We chose to focus instead on
developing other cost advantages, such as building our own backend storage
clustering software.

What if SpiderOak suddenly decides to start doing this in the future?

We probably won’t… and if we did, it’s not possible to do so
retroactively with the data already stored. Suppose we were convinced someday;
here are some ways we might minimize the dangers:

  • We would certainly discuss it with the SpiderOak community first and incorporate the often-excellent suggestions we receive
  • It would be configurable according to each user’s preference
  • We would share some portion of the space savings with each customer
  • We would only de-duplicate on commonly shared and traded filetypes, like mp3s, where it’s most likely to be effective, and least likely to be harmful
Comments
  1. With SpiderOak, isn't the file encrypted the from the application on my PC, through writing to disk on SpiderOak's servers? I thought SpiderOak did not have a useful version of a user's password, only a hash of some sort. So then, if user Bob and user Alice separately upload the same file, I thought it would be the case that the files would appear as different files to SpiderOak… Technically, then, SpiderOak wouldn't be capable of de-duplication across users, in the current "zero-knowledge" design. Is that correct?

    If however, this cryptographic fingerprint is available to SpiderOak, then the contents of a file can be assumed if a fingerprint matches another file, then "zero-knowledge" really isn't a fair term to describe the process.

  2. Scott — Yes, correct on all points. And we have no intention of ever knowing file fingerprints.

    Every file is crypto'd in a unique context, so even when many users do upload the same plaintext data blocks, they aren't the same on the server.

  3. Hi Scott,

    Can you comment on two other possible reasons?

     1. Spideroak could be turned from a content store into a content host.
     2. Spideroak would have the ability to easily remove all copies of a given file.

    To elaborate on the first point, if data was shared across users, having the hash alone would be enough to retrieve  a file. The client could be modified so that it never operated on the files themselves and instead used supplied hashes. The person would then "upload" a file they never had and then download it to get a copy of the file. Sipderoak would be turned from a backup service into a hosting service.

    Under this hosting scenario it would seem reasonable for it to be ordered that spideroak stop hosting a given file – that is, to remove all copies of a given file.

    Did these possibilities play any role in the decision?

    Thanks,
    Jonathan. 

  4. Jonathan – I can't imagine the shame of trying to explain to our customers how we had to remove something from their data!

    Following your line of thought on the difference between content store vs. host: Yes; we definitely don't see SpiderOak as a hosting service, and if it were used that way, it would have to be priced much higher because the costs involved are just greater. SpiderOak makes your backup/synced data available to you whenever you need it, but that's very different from the demands of providing a primary data store that is constantly accessed as the original source. Indeed, one of the reasons we started SpiderOak is that we knew we could offer backup at a better price than general hosting companies.

  5. Hi Alan, nice post. Regarding backing up a file and then modifying the file: You say you only need to save the new data blocks. What are these data blocks, and how big are they? If you have multiple small files, do their leftover parts get thrown together, or do they each get their own block, or what? If, say, I remove a character from the beginning of a file, how can SpiderOak avoid sending the whole file again?

  6. "Deduping via Cryptographic Fingerprints Enables The Ultimate Sin" — It is expected that a SHA-1 collision will be published soon. But arbitrary collisions do not translate into viable attacks against de-duplicated backup databases. There is no current known method for Bob to generate a collision with Alice's file. Additionally, SHA-256 should be used anyway.

    The information leak where anyone can check for the presence of an arbitrary file in the backup store is a protocol vulnerability which can be fixed by requiring the user to always upload a file, and removing it afterwards if it is a duplicate — similar to how byte-by-byte comparison algorithms expose data through timing attacks.

    These two complaints against deduplication are really only complaints about a hypothetical naïve deduplication implementation with a weak hash function. The only valid complaint here is that deduplication requires that file hashes be shared with the operators of the service.

  7. Anonymous —

    Collisions: Agree that stronger hash functions mitigate the risks, but they often cause enough additional client-side CPU time that I have seen implementations that just use md5. If I recall correctly, rsync actually uses md4 because it's faster.

    Protocol weakness — total agreement, but please note that popular online storage services today have this protocol weakness. The incentive to save the expense of bandwidth by doing the de-dupe client side is pretty strong.

    Another detail is that user's can operate their own, perhaps rogue clients (i.e clients other than those that are provided by the storage service.) So, you have to guard against clients intentionally poisoning the de-duplication database.

  8. Sam – Thanks for the questions — I'm going to describe the SpiderOak algorithm that handles the topics you mentioned in a future blog post. I am going to need some graphics as it is a bit hard to explain without visualizations. For background, it will be easiest to approach if you already know how the Rsync algorithm works, which SpiderOak greatly extends, and is fascinating reading if you care about these things (and very boring if you don't!) In any case, it's described here: http://samba.anu.edu.au/rsync/tech_report/

  9. Alan – 2/3rds of my SO storage (100+ gb) is music. On my "high speed" DSL line it took 45 days to upload the music, the majority of the mp3s and mp4s many users might have as well.

    Perhaps it may be useful to have the capability to avoid initial uploads for user-selected directories by first checking hash collisions of the directory files before uploading? I understand poisoned hashes, but with SHA-128+, etc., isn't this a bit remote? And even if SHA-128 or other strong one-way functions are slow, they are still a lot faster than uploading an encrypted binary file.

    (Obviously, someone like Google, if they offered backups might sell the information that I like a particular band / genre to 3rd parties, but it's a trade off I may make to use less space and obtain faster file uploads.)

    In extreme, suppose a Category 4 or 5 hurricane is approaching the eastern US coastline and people begin thinking of backing up files. While scanned photos, personal files,and documents cannot be replaced, what average user is going to triage his files: replaceable, irreplaceable, and junk?

    And if the average user is like me, with a lot of music, then it's goodbye bandwidth.

    Regards, Eric.

  10. Noob here, but could this security/privacy vulnerability of de-duplication be circumvented, by simply packing the sensitive file (together with some random stuff) into a .zip file and uploading the latter? To my mind such an attack scenario seems pretty unlikely, whereas a 20% saving of storage space and bandwidth is very real.

  11. This is the single biggest reason why I used Spideroak. It demonstrates that the company 'gets it' when it comes to security, and demonstrates a commitment to that principle.

  12. May I suggest you link the Engineering Matters discussion of Storage Redundancy Savings to this thread? When I noted the FAQ's promoting deduplication as beneficial to the user, I very nearly took a pass on using SpiderOak. I was concerned about the the implications of global deduplication that Alan raises above. Only by searching back through the BLOG archive was I able to find this reassuring thread.

  13. @Catheryne
    The problem is that once you obfuscate the data in any way (encrypt, zip, etc) you make it impossible for the provider to deduplicate it against global data anyway so you would not get any bandwidth savings from this.

    We at SpiderOak do deduplicate within your account, saving YOU as much data and bandwidth as possible. We just don't deduplicate CROSS account.