Utility Computing Meets Personal Backups
Short Version
A drive died in my fileserver, providing the impetus to use “Cloud Computing” to do my personal backups.
S3 is Amazon’s Scalable Storage Service, one of the components of Amazon Web Services (AWS). AWS is a set of Cloud Computing resources for creating scalable Internet businesses which can benefit from a scale-on-demand scheme. In particular, if you have an Internet business that has large spikes in traffic, then AWS can lower costs dramatically since you pay only based on the resources used.
However, back of the envelope calculations show that for applications which don’t require scaling (e.g. home backups), AWS can be fairly pricey when compared to simple alternatives like web hosting services, even when availability is considered. Using rsync to backup to a web hosting service can be done for less than 1/10th the price of Amazon’s S3, with very little setup involved. Availability can be easily enhanced beyond that provided by S3, by employing a second hosted server, still at 1/5th the price.
AWS is designed for applications which could benefit from scale-on-demand scheme. If your application doesn’t need this, then it might not make sense economically.
Long Version
The Joys and Sorrows Of Owning Shiny Metal Boxes
One of the SATA drives on my home file server died a horrible death this week. The server had been acting funny all week, giving me the occasional strange I/O error. Finally, the system would not boot , instead showing the familiar “Insert CD:” error that shows on the screen when the boot drive fails. Attempting to auto-recognize the first SATA drive from the BIOS made the computer think for awhile, and then come back without finding anything. Unfortunately, this box was also running slimserver which provides access to my music collection from my Squeezebox. No streaming music for the BBQ I’m hosting this weekend (we’ll have to switch to the redundant iPod + boombox).
I originally built the computer as a tiny fileserver-on-the-cheap for things like my videos, music collection, and family photo album. At the time, 2 years ago, I had been looking at SOHO NAS systems for a while, but just couldn’t get over how much they cost. Also, for the price I thought I deserved unfettered access to the underlying server to put whatever I wanted on it. Whenever I see some new device with Linux on it, the hacker in me wants access to bend it to my will. For the server, I used the tiniest enclosure I could find that fit microATX, which at the time was the Antec 1380, a small cube about the size of 2 shoe boxes. One slight design flaw was putting a powerful (but inexpensive) CPU in the box: an AMD dual-core 3800. Yes, I succumbed to market forces and, in the popular venacular, “biggie sized it.” While I got the CPU at a good price, it ran a bit hot. In retrospect I wonder if this extra heat caused my drive to fail.
If I had to buy my own equipment today, I’d probably go with the Thecus 5200 (dedicated SOHO NAS). Let’s face it, owning tiny, shiny boxes packed with technology is fun. But it’s also expensive, and eventually, everything breaks.
Cloud Computing To The Rescue?
A natural response to owning things that break is to not own them anymore, which brings us to Utility or Cloud Computing. I had preferred the earlier term “Utility Computing,” as I thought this moniker more accurate, although Cloud Computing does sound cooler. Eventually the coolest sounding name wins mindshare and we are forced to use it.
Also, a problem with the term Utility Computing is that the the analogy of “computing power as electricity” is limited. Utility Computing is less like electricity, more like lego pieces. Utility, er rather Cloud Computing already has a variety of different payment models, and service types, each with different features which are important for solving different types of problems. One important feature to emerge is the ability to scale-on-demand, which is being pioneered by companies like Amazon, with their Web Services (AWS).
Scale-on-demand (my own term) is a really cool feature, which is part of the basic strategy for Amazon Web Services. The basic argument is if one must own/rent equipment to cover the maximal use, then the equipment will not be fully utilized. Consider a company whose normal utilization is 1/10th of periodic spikes in activity, say around noon. In own/rent models, the company would have to have 10 times the equipment just to handle the spikes, even if the spikes only lasted for 1 hour. Calculations show the average utilization would be far lower => 9 + 24 computer hours => 33/24 = 1.375. So we need 1.375 computers instead of 10, if we could spread our load out consistently over time.
This shows that there is room for a middleman who aggregates computing resources to make a profit. If the middleman can effectively spread out utilization spikes so that the utilization pattern for a large number of companies approaches the average utilization, then machines can approach 100% utilization. This is exactly the approach that Amazon is taking with its various services which make up AWS.
AWS has many individual component services which each have different applications. Those of you who work in the Internet space, or distributed computing will recognize many of the pieces of scalable Internet storage – entity storage, mySQL database instances, key value persisted storage, virtual instances (ala Xen), durable queues. There are already many startup companies who have built their entire infrastructure using AWS.
Are Personal Backups Using S3 Cost Effective?
I’ve seen a lot of blog posts about doing personal backups to S3, which one must admit sounds extremely cool at first glance. S3 is Amazon’s Simple Storage Service, which allows one to write, read and delete objects of up to 5GB. The objects are retrieved from buckets you define via the bucket id and object name.
The best way to do incremental backups is to look at the state of what is backed up, and then produce a delta of what has changed. This way, we’d only send the portions of the files that had changed. This delta system works extremely well for many types of files. And it turns out there is a well-known and dependable tool which does this well: rsync. Rsync remotely calculates a delta, and sends only the minimal set of file updates via ssh. There are also some clever minimal backup packages that use Unix hard links to provide “minimal space snapshots.” This technique is so cool and simple, but it does rely on hard links, which are not available to every OS (but all *nix have them). If hard links aren’t available in a particular OS/filesystem, rsync can be trivially used to maintain a single up-to-date snapshot (a cron job invoking rsync is all it takes).
Note that rsync should be installed on any Linux host from a web hosting company, so if you have ssh access, you can use this backup technique, with almost no setup required. Here are some links:
- Rsync – file copy utility which computes and transmits minimal deltas to update.
- Rsnapshot – a backup program based on rsync and written in perl which implements the minimal space snapshot idea with hard links.
Next, lets look at the cost analysis. Remember that the Amazon Web Services can help us save money, but are designed from the standpoint of providing scale-on-demand ability versus purchasing for peak utilization. We are essentially “buying insurance” against usage spikes. But our simple backup application is completely predictable, and will never have any usage spikes. This is for our home backups, not to provide scalable backups for all the consumers out there on the Web. The backups are easy to schedule so that there are never any spikes, and thus we can completely predict our usage. Not only that, but backups are not latency sensitive – I just want to make sure they happen every night. I don’t really care if they take an extra 10 minutes.
S3 costs $0.15 per GB per month for storage. For 500 GB, this would be $75 a month. Web hosting services provide a variety of monthly plans from about $11, and give between 500 and 2500 GB. In fact, there are services which offer “unlimited storage.” Unlimited storage is simply another play on the averages. If you have enough customers, you can give people unlimited storage, because your average storage will be extremely low. It’s like an all-you-can-eat buffet: the average person is paying for the people loading their plates.
Thus, purely from a cost perspective, storage for backups from a web hosting company should be 1/8th – 1/34th the cost of Amazon S3.
One final issue to look at is availability. Amazon has an actual SLA for S3 which guarantees 99.9% uptime, or you will get reduced rate for the period of the outage. This is actually pretty cool as they will return from 10%-25% of your money for that month. Keep in mind this is good, as the incentives will push Amazon to get better and better, but it’s no guarantee of stability. Once they have a “really bad month,” the incentive to keep the month good goes away. After all, under the Amazon S3 Service Level Agreement they could be down the entire month (0% uptime), but still charge you 75% for storage fees. In extremis this sounds ridiculous, but this is a problem in the structure of most SLA’s, not just Amazon’s.
In the month of July, Amazon S3 had an 8 hour outage due to a data corruption problem. Amazon CTO Werner Vogels mentions in his blog the root cause was single bit corruption of internal state messages that are distributed via Gossip techniques. It’s good to keep in mind that the core technologies for AWS are new and so have a few minor kinks to work out. While Amazon does not have to rely on software upgrades to existing services for revenue, they still have to provide missing features for their customers. So I would expect errors like this to abate over time, but not completely vanish. In a distributed system, any error, regardless how small, becomes serious through amplification.
This puts them at 98.88% for the month of July, just for this event. To their credit, the response was very professional, swift and public. CEO Jeff Bezos even talked about it. To put this in perspective, service companies usually apologize in private, and try to make it up when the monthly SLA reports are given to customers. In my mind this is one of the things, besides the technology, which puts Amazon way out in front in the race to define this industry. I especially like the Amazon AWS dashboard which shows their history of outages.
The web hosting companies have no SLAs for most self-serve customers (although upscale web hosting, which costs more, may have SLAs for businesses). Uptime is measured by third party companies. It’s not clear how the measurements are taken or what the relationships are from the hosting companies. I have to believe the hosting companies are providing advertising money to these uptime companies, which doesn’t bode well for the objectivity of the measuring company. Still, it is not hard to find companies with uptime of 99.5%. This would equate to 2 hours of downtime a month, which for a backup solution, seems completely acceptable, especially at 1/8th the price. This does not take into account data transfer, or computational costs, which would be free for the web hosting company, but translate into even higher S3 costs.
Also note you can have a secondary web hosting service, or even a different server with the same service, to get a much better availability through trivial redundancy. Note that two hosting servers would give availability of 99.9975%, which is 1 minute of downtime a month, and far better than Amazon, at 1/4th the cost.
Note that either of these solutions is superior to owning your own hardware, as Jeremy Zawodny calculates. Also, if you are interested in software which backs up to S3, Jeremy has done the legwork for you.
Amazon S3 (and AWS) really shines as a model for scale-on-demand. This makes a lot of sense for things like a website with a video that becomes popular overnight, and needs to scale up to meet a rush of users. But the scale-on-demand terms approach loses the pricing advantage when the max utilization is close to the average utilization. For long term storage this tends to be true, as it grows very slowly. I’m thinking S3 is designed more for scaling I/O reads and writes, which are costly features our backup solution won’t use.
Don’t get me wrong, I think S3 and AWS in general are the coolest thing since sliced bread, but they don’t make sense for every project (especially where scalability isn’t important). We’ll come back to some cool uses of AWS services in future articles where the model does make sense.
Giving Dreamhost A Try
I’d heard good things about Dreamhost, so I’m giving them my business. Setting up backups was trivial, rsync was already installed so it worked on my first rsync command attempt. We’ll see how the availability is over the coming months.