How do you handle Amazon EC2's failures
We've just moved our website to Amazon EC2, and within about 2 hours of going live, or proxy server went down. Just disappeared. We couldn't even terminate the instance.
OK, temporary glitch. It happens.
2 weeks go by, then last night, our alarms go crazy. All 3 database servers have gone down. They're there, just not responding to ssh or even ping.
We try to reboot the instances. From the console log, I can see that they reboot. Still not accessible. I launch another instance of our DB AMI. It boots, but is also unresponsive.
Eventually we boot a vanilla AMI, reinstall the DB and attach the EBS volumes.
Next surprise - 2 of our EBS volumes have disappeared - or at least the data on them has. Fortunately, they were redundant copies. But what happens if next time they're not?
All in all we had 2 hours of downtime. More than the previous 3 years with dedicated servers put together.
How on earth do other companies maintain their uptime (and data!) on a service that seems to fail way too frequently?