October 26, 2009
Why and How SpiderOak architecture is different than other online storage services: The surprising consequences on database design from our Zero-Knowledge Approach to privacy.
First, let’s consider the design of products like Mozy or SugarSync. On the server they have a database of all of your files. This database includes foldernames, filenames, last modification times, sizes, etc, all plainly readable. Maybe they encrypt the data (the contents of each file). The backup client on your computer knows which files need to be uploaded by talking to the server and querying for differences between the local file system and the remote database.
In the SpiderOak world, no such central database of your files exists. Rather, you keep your own database . If you have several computers all connected to your SpiderOak account, each of them maintains a local database giving them a full view into your account-wide storage.
This client-side database is updated continuously as uploads from all computers in your account progress. Each upload is a transaction. We stuff changes into a transaction until it reaches 10 meg or 500 files . The contents of the transaction are sequentially numbered data blocks (the data) and entries in sequentially numbered journals (meta data). For each transaction, the server stores everything, and passes the meta-data only along to all the other devices in your account.
In this sense, SpiderOak is really more of a peer-to-peer application than a client-server application. The traffic all goes through central servers but that’s just a conveniently reliable medium for data-passing and storage. The servers can’t read any of it.
There are some clear benefits and challenging drawbacks of peer-to-peer database architecture.
The biggest benefit is obviously the stated goal of preserving full and complete privacy – or ‘Zero-Knowledge’ as we call it.
One drawback is that it’s harder to program. Usually complex systems are built first with central management and then evolve into more peer architecture for scalability reasons. Think of Napster evolving into things like Gnutella and BitTorrent. We enjoy the challenge of creating privacy-preserving software that works just as well as the alternatives. Indeed, this is one of the main
reasons we started SpiderOak. However, it does mean that almost all features require more implementation work than they might if we had chosen unencrypted storage.
Another drawback is that CPU and memory use are sometimes higher. We’re working steadily to minimize this, but ultimately SpiderOak simply has more work to do than products that don’t maintain a ‘Zero-Knowledge’ orientation. SpiderOak is much better in recent versions at minimizing system resources than it has been historically, and – in the next few versions – this will dramatically improve again.
Yet another drawback is that computational work is duplicated on each client. Instead of the server updating a single database once for each change, each of your computers updates its database for all changes. (Obviously this disadvantage doesn’t apply if you only use one computer with SpiderOak.)
That said, a surprising benefit is the implications for total service cost. You may have noticed that SpiderOak offers some of the best pricing per gigabyte for online storage available anywhere. There are other factors contributing to this, but it definitely helps that SpiderOak clients handlemost of the database work. The server’s role is mostly relegated to data storage and retrieval. This lets us focus on building servers with very dense storage without the need for high speed databases and lots of system memory to
run them in. (Although some of those needs reappear for servicing functions like Web-Access and SpiderOak Shares.)
For us, regardless of the advantages and drawbacks of the decisions we made, the choice has always been clear. We set out to build a backup system we ourselves felt comfortable using which is why ‘Zero-Knowledge’ privacy was always the right path for us.
1 If you want you can go examine this database yourself. Hint: Use the libraries from our code page to make more sense of the database structure. The database files implement a complete transactional filesystem inside a database, as well as some relational tables.
2 If you ever wondered why in the upload status, you’ll see several files all uploading at the same time, with the percentage complete changing in unison for all of them, this is why. Each transaction is uploaded as a unit. The server doesn’t know which or how many of your files it contains.