Thoughts on “Amazon Offsite Backup”

A lot of people that ask me what my plan is “when Amazon goes down”. It is hard to answer this question directly, since I think most users still see AWS as one cloud or global all-encompassing service. In reality it is:

  • Eight regions (9 if you count gov cloud) in different parts of the world.
  • Multiple availability zones in each region, providing physical isolation.

Amazon’s default advice is that it is your responsibility to make sure your application can survive an Availability Zone outage – and in my case I almost can: databases are Multi-AZ, webservers are Multi-AZ. The only piece of infrastructure that currently violates this is a search service that ties us to us-east-1a via an EBS volume.

However, Availability Zones won’t cover regionalized disasters, such as Hurricane Sandy, and it won’t cover all of Amazon’s oopses.

For the applications which we need higher availability than multi-AZ, I would much rather exhaust all of AWS’s seven other regions since I can guarantee 100% compatible APIs. When I’ve finished with this list, to me it’s time to start looking at third party providers. I think only a few edge cases fit in this category, such as NSD existing to increase the gene pool against software flaws/exploits.

It is also very easy to purchase a DNS service with latency-based routing and failover (via a probe URL you can specify) with providers like DynDNS and Neustar’s UltraDNS to implement an active/passive or an active/active (requires application support). AWS even announced DNS based failover this year, but at the moment it has a critical limitation that it can not health check its load balancers. Maybe in the future this will get even easier!

Benchmarking & Prewarming Amazon ELBs

One of the things we discovered when benchmarking our improvements to OpenX 2.6 – is that it is actually very difficult to do so on EC2. How I assume ELBs work inside Amazon, is that they are built on top of EC2 instances, and you start off with one EC2 instance per availability zone you have selected. The load balancers are then load balanced themselves via DNS round-robin. This allows Amazon to treat every AZ as physically isolated without cross-talk interdependencies.

So now the part where I said it is difficult:

  • If you fire traffic at your load balancer in a naive way – what you will often find is that you always hit just a single load balancer in one availability zone. This seems to maybe max out at 20K requests/minute even if you have sufficient capacity behind the balancers.

  • Even if you fire traffic from multiple locations to get around the cached DNS result, it still starts off scaled down. Like I said above, I think you start off with one EC2 instance per availability zone selected. Amazon seems to employ their own auto scaling to detect how much capacity you need and expand the resources based on this. From my anecdotal evidence you should expect this to take 30 minutes to 1hr.

We went live in December 2011 with our OpenX 2.6 changes actually knowing/having discovered this pre-warming limitation, but expecting it to be closer to 20 minutes (we were on a short deadline running out of capacity in our data center). It was a test of nerves to say the least.

What I know now, is that you don’t have to take the hit at all. All you need to do is buy support with Amazon, and then open a ticket and ask them to manually scale up to be able to handle X requests. They will ask you to specify a timeframe you need this manual scaling for (since they don’t like to keep things in manual mode), but other than that this avoids all the pains I spoke of. Fast forward to 2012, and I managed to serve 242K requests/minute peak during an Apple product launch, and servers didn’t break a sweat.