How do you handle Amazon EC2's failures

We've just moved our website to Amazon EC2, and within about 2 hours of going live, or proxy server went down. Just disappeared. We couldn't even terminate the instance.

OK, temporary glitch. It happens.

2 weeks go by, then last night, our alarms go crazy. All 3 database servers have gone down. They're there, just not responding to ssh or even ping.

We try to reboot the instances. From the console log, I can see that they reboot. Still not accessible. I launch another instance of our DB AMI. It boots, but is also unresponsive.

Eventually we boot a vanilla AMI, reinstall the DB and attach the EBS volumes.

Next surprise - 2 of our EBS volumes have disappeared - or at least the data on them has. Fortunately, they were redundant copies. But what happens if next time they're not?

All in all we had 2 hours of downtime. More than the previous 3 years with dedicated servers put together.

How on earth do other companies maintain their uptime (and data!) on a service that seems to fail way too frequently?


I'm afraid this doesn't help as such.. but...

I've had an instance running for over a year without issue (it's in the US zone, don't know if that makes any difference).

EBS I take hourly/daily/weekly/monthly snap shots (rotating all except the monthly) to S3 Net::Amazon::EC2 lets you use the API for this.

I also run a spare machine on another cloud for major disaster recovery scenarios, although I've not had to use it yet *cross fingers*

You could always give Rackspace a try with their Cloud Servers. When those go down your data doesn't get wiped away. Their support also seems to be very good, although I haven't really experienced that (no need yet...) as I currently only use Cloud Files.

Leave a comment

About Clinton Gormley

user-pic The doctor will see you now...