October 26, 2009

Why and How SpiderOak architecture is different than other online storage services: The surprising consequences on database design from our Zero-Knowledge Approach to privacy.

by with 11 comments

First, let’s consider the design of products like Mozy or SugarSync. On the server they have a database of all of your files. This database includes foldernames, filenames, last modification times, sizes, etc, all plainly readable. Maybe they encrypt the data (the contents of each file). The backup client on your computer knows which files need to be uploaded by talking to the server and querying for differences between the local file system and the remote database.

In the SpiderOak world, no such central database of your files exists. Rather, you keep your own database [1]. If you have several computers all connected to your SpiderOak account, each of them maintains a local database giving them a full view into your account-wide storage.

This client-side database is updated continuously as uploads from all computers in your account progress. Each upload is a transaction. We stuff changes into a transaction until it reaches 10 meg or 500 files [2]. The contents of the transaction are sequentially numbered data blocks (the data) and entries in sequentially numbered journals (meta data). For each transaction, the server stores everything, and passes the meta-data only along to all the other devices in your account.

In this sense, SpiderOak is really more of a peer-to-peer application than a client-server application. The traffic all goes through central servers but that’s just a conveniently reliable medium for data-passing and storage. The servers can’t read any of it.

There are some clear benefits and challenging drawbacks of peer-to-peer database architecture.

The biggest benefit is obviously the stated goal of preserving full and complete privacy – or ‘Zero-Knowledge’ as we call it.

One drawback is that it’s harder to program. Usually complex systems are built first with central management and then evolve into more peer architecture for scalability reasons. Think of Napster evolving into things like Gnutella and BitTorrent. We enjoy the challenge of creating privacy-preserving software that works just as well as the alternatives. Indeed, this is one of the main
reasons we started SpiderOak. However, it does mean that almost all features require more implementation work than they might if we had chosen unencrypted storage.

Another drawback is that CPU and memory use are sometimes higher. We’re working steadily to minimize this, but ultimately SpiderOak simply has more work to do than products that don’t maintain a ‘Zero-Knowledge’ orientation. SpiderOak is much better in recent versions at minimizing system resources than it has been historically, and – in the next few versions – this will dramatically improve again.

Yet another drawback is that computational work is duplicated on each client. Instead of the server updating a single database once for each change, each of your computers updates its database for all changes. (Obviously this disadvantage doesn’t apply if you only use one computer with SpiderOak.)

That said, a surprising benefit is the implications for total service cost. You may have noticed that SpiderOak offers some of the best pricing per gigabyte for online storage available anywhere. There are other factors contributing to this, but it definitely helps that SpiderOak clients handlemost of the database work. The server’s role is mostly relegated to data storage and retrieval. This lets us focus on building servers with very dense storage without the need for high speed databases and lots of system memory to
run them in. (Although some of those needs reappear for servicing functions like Web-Access and SpiderOak Shares.)

For us, regardless of the advantages and drawbacks of the decisions we made, the choice has always been clear. We set out to build a backup system we ourselves felt comfortable using which is why ‘Zero-Knowledge’ privacy was always the right path for us.

1 If you want you can go examine this database yourself. Hint: Use the libraries from our code page to make more sense of the database structure. The database files implement a complete transactional filesystem inside a database, as well as some relational tables.

2 If you ever wondered why in the upload status, you’ll see several files all uploading at the same time, with the percentage complete changing in unison for all of them, this is why. Each transaction is uploaded as a unit. The server doesn’t know which or how many of your files it contains.

  1. Always interesting to hear about the inner workings a a great software! Nevertheless I hope there are some features coming that are not only visible behind the scenes…

  2. I imagine that a copy of the local database is backed up to the server (encrypted) and can be retrieved and decrypted if you lose the local copy.

  3. Brian Glass — Good question. The local database can be reconstructed by any device in your account by retrieving all the journals (meta-data) from the server and reading them. This is what a client does whenever you add a new device or reinstall an existing device within your account. (And at the moment this process takes can sometimes take several minutes longer than it should, but that should be corrected in the next few days as we improve the protocol there.)

    Ted — Yes, keys are rather important! We once talked to a customer (before they came to SpiderOak) who had been rolling their own offsite backups with GnuPG signed tarballs shipped offsite. Backups were routinely audited and found to be in good working order. However, when they went to restore after a real crash, they realized that their only copy of the GPG key was _inside_ the backup tarball. Not very useful.

    With SpiderOak of course, the keys are transparent to the user. You just have to (have to!) remember your password/phrase. They outer level key is derived from the passphrase.

  4. Thank you for a great explanation – this really helped me to grasp the behavior of what I observed was going on, now it makes sense.

  5. Posts like this are exactly the reason you gained me as a customer. I like the honest tone and the expertise shown in the text. I highly recommend SO to all of my friends.

    (And please, become open source as fast as possible. Just to remove the last doubts, if you are really doing what you claim to do and to have other smart people a look on the implementation of your concepts).

  6. I really like that you aren't scared of going into details about the technical workings behind spideroak.
    Just bought myself a 100 GB account :)

  7. How can SpiderOak comply with the US law if they can't decrypt the data? Isn't SpiderOak an American company? I like your ideas how to never be able to access the customers data, but since "The Patriot Act" was signed all American companies must be able to retrieve the customer data if the government demands it. Or is the encryption weak enough to be brute-forced in a resonable time?

  8. @Tobbe – you are only sending encrypted blocks to SpiderOak. The blocks are encrypted on the client side, and simply stored at SpiderOak.

    If subpoenaed and the warrant was valid, SpiderOak could be legally compelled, in some rare circumstance, to give law enforcement what you gave them – yep – the encrypted blocks. Pretty useless to anyone except the owner, YOU.

    This would be COMPLETELY different, if, say, your data (actual data, file names, etc.) was uploaded as-is, and stored on their servers, and then SpiderOak were compelled to divulge the data… IF that were the case, then THEY would have to give up THEIR encryption keys to the files, yes. …but, luckily the Zero-Knowledge design is sound, and NO ONE has the keys to your data – except YOU.

    Keep up the great work SpiderOak.