November 7, 2011

SpiderOak’s new Amazon S3 alternative is half the cost and open source

by with 20 comments

As 37signals famously described, in the software business we almost always create valuable byproducts. To build a privacy-respecting backup and sync service that was affordable, we also had to build a world class long term archival storage system.

We had to do it. Most companies in the online backup space (including BackBlaze, Carbonite, Mozy, and SpiderOak to name a few) have made substantial investments in creating an internal system to cost effectively store data at massive scale. Those who haven’t such as Dropbox and JungleDisk are not price competitive per GB and put their efforts into competing on other factors.

Long term archival data is different than everyday data. It’s created in bulk, generally ignored for weeks or months with only small additions and accesses, and restored in bulk (and then often in a hurried panic!)

This access pattern means that a storage system for backup data ought to be designed differently than a storage system for general data. Designed for this purpose, reliable long term archival storage can be delivered at dramatically lower prices.

Unfortunately, the storage hardware industry does not offer great off-the-shelf solutions for reliable long term archival data storage. For example, if you consider NAS, SAN and RAID offerings across the spectrum of storage vendors, they are not appropriate for one or both of these reasons:

  1. Unreliable: They do not protect against whole machine failure. If you have enough data on enough RAID volumes, over time you will lose a few of them. RAID failures happen every day.
  2. Expensive: Pricy hardware and high power consumption. This is because you are paying for low-latency performance that does not matter in the archival data world.

Of course #1 is solvable by making #2 worse. This is the approach of existing general purpose redundant distributed storage systems. All offer excellent reliability and performance but require overpaying for hardware. Examples include GlusterFS, Linux DRBD, MogileFS, and more recently Riak+Luwak. All of these systems replicate data to multiple whole machines making the combined cluster tolerant of machine failure at the cost of 3x or 4x overhead. Nimbus.IO takes a different approach using parity striping instead of replication, for only 1.25x overhead.

Customers purchasing long term storage don’t typically notice or care about the difference between a transfer starting in 0.006 seconds or 0.6 seconds. That’s two orders of magnitude of latency. Customers care greatly about throughput (megabyte per second of transfer speed) but latency (how long until the first byte begins moving) is not relevant the way it is if you’re serving images on a website.

Meanwhile the added cost to support those two orders of magnitude of latency performance is huge. It impacts all three of the major cost components – bandwidth, hardware, and power consumption.

A service designed specifically for bulk, long-term, high-throughput storage is easily less than half the cost to provide.

Since launching SpiderOak in 2007, we’ve rewritten the storage backend software four times and gone through five different major hardware revisions for the nodes in our storage clusters. Nimbus.IO is a new software architecture leveraging everything we’ve learned so far.

The Nimbus.IO online service is noteworthy in that the backend hardware and software is also open source, making it possible for people to either purchase storage from Nimbus.IO similar to S3, or run storage clusters locally on site.

If you are currently using or planning to adopt cloud storage, we hope you will give Nimbus.IO some consideration. Chances are we can eliminate 2/3 of your monthly bill.

Comments
  1. Suggestion: I'm assuming you want an email address for the invite field… you may want to specify that.

  2. Wait is this $0.06/(GB*month) or $0.06/(GB*year) or something else I'm missing? If the latter I can decommission my server as soon as I get the invite…

  3. While I would love Alex's understanding about price [$6per year for 100GB] but looking at other prices it seems too much to wish for.

    In case price remains what it is will it be possible to bill in smaller units of 10 or 5 GB?

    When will you "Open source the SpiderOak client software " ? https://blog.spideroak.com/20110628114417-spideroak-looking-inward
    Till that happens we can not edit the user interface and find other use cases for this service. Give us the choice!!!

  4. > In case price remains what it is will it be possible to bill in smaller units of 10 or 5 GB?

    yep…

  5. $6 per 100GB of purchased storage. Transfer in is always free. Transfer out is $0.06/GB. PUT and LISTMATCH requests are $0.01 per 1000. Other requests are $0.01 per 10,000 or free.

  6. I'm eagerly awaiting this. I hope that nimbus will provide a filesystem like interface as well; currently I backup information to Amazon S3 via s3fs, and I'd love to be able to do this with nimbus as well (bonus points if it's Windows compatible (or webdav) as well); a good client with smart caching can make offsite storage for infrequently used files (but with occasional access) to be quite pleasant.

  7. I agree with John. Can you share a bit more details about architecture? what components are opensource and what are not?

  8. @john Dickinson @Slav
    Nimbus.io will be 100% open source. This means that both software and hardware.

  9. I'm running a 100TB GlusterFS cluster across 6 nodes, but I am interested to know how your approach differs from theirs (aside from your Arch page), and what benefits it would offer to running standalone (ie- in my own datacenter)

  10. The biggest advantage of Replication Based systems is geographic redundancy is simple. LA can go down in an earthquake and your data has no downtime.

    With Parity Based storage, rebuilding is super painful. If you loose a node, you need to do crazy reads all across the planet to maintain uptime. And if a datacenter goes down in a fire, you probably lost all your data completely.

    How do you deal with geographic redundancy of your data?

  11. Will you have a client similar to (but hopefully more lightweight) the spideroak app? So something that's cross platform and linux cli. I don't use any of the advanced features in spideroak like sync folders or shared files, I just use it for offsite backup. If I could get that for less I'm all for it.

  12. I'm a bit confused… Am I on some kind of non-amazon storage plan now and will it be cheaper to switch to one?

  13. @todd, we are playing around with what we can build client-wise for this. At launch however we will be providing the API and storage back end for bulk data storage only.

    @snirp I see no reason why not. Except if you have exceptionally high demands on the response time.