Amazon have suffered a number of problems in the last couple of days. Disturbingly it seems that Amazon's concept of separate "availability zones" - parts of their infrastructure which should be completely unaffected in the event of issues elsewhere - may not have delivered as planned. As a result a number of high-profile websites have been knocked off the net: it's been fashionable to host infrastructure on the "public cloud" and the Web 2.0 crowd have suffered this week.
Almost exactly a year ago I wrote about our approach to achieving reliability and our approach to using third-parties within our infrastructure; it's worth reiterating now:
Basically, we think that the cloud model is something which works - brilliantly - where there's a simple customer/supplier relationship but that it can break down when there are hierarchies of services unless you think very carefully about how you will deal with the contingencies. It's a little different from traditional business relationships where you have the luxury of at least a little time to sort out most issues: we need our infrastructure to be always available and reliable. We don't want to be involved in trying to diagnose a third-party infrastructure (like EC2) and having the responsibility to sort out issues within it without having the ability to do so.
So we took a different route. We built our own infrastructure and we are responsible for its management - right down to the hardware. Although we do use third parties for some of the components, there is always redundancy: multiple networks, multiple locations. If a provider fails to deliver a service we can call on an alternative so we can be certain we can deliver the service levels we commit to in our SLA. Maybe we will use some Amazon services in the future but if we do they'll be non-core and we'll be sure to have a backup plan.
In the meantime, we can see that our approach has worked and we are able to report some excellent availability statistics for the past twelve months. We are not complacent and we do not take this for granted: we will continue to be as proactive as we can to reduce risk further.
I will be interested to read Amazon's post-mortem analysis of what went wrong to see what we can learn from this.