Build a MySQL Fabric Farm in one step using AWS CloudFormation

I have been building a CloudFormation template for MySQL Fabric as an experiment to kick the tyres and further my understanding of how it all works.

For those not familiar, CloudFormation is a way of representing a collection of Amazon Cloud resources (a "stack") into a static json file (a "template"). In my previous company, we managed our AWS account almost exclusively via CloudFormation, and there are two key advantages I see with managing resources via templates:

  • It allows for all environments to be synchronized (dev/qa/production).
  • It allows for very fast disaster recovery (you can very quickly pick up and restore in another region).

CloudFormation is also a great-fit for MySQL Fabric, since it can remove many of the essential bootstrap steps that come with building a distributed system. It took a bit of work, but I managed to build a template to bring Fabric installation down to three simple questions:

Here are some of the interesting parts:

  • I am using the latest Amazon Linux AMI + the official MySQL yum repositories.
  • The fabric HA group GLOBAL1 is created in an AutoScalingGroup.
  • When the instance in the AutoScalingGroup starts up, it talks back to the fabric backing store to add itself to the HA group.
  • If the AutoScalingGroup is expanded to 2, a new instance will talk to the fabric backing store, find a peer in the same group and create a backup. Once the backup is restored on the new server, it will add itself to the Fabric HA group.

The less interesting part of course is that this is a proof-of-concept. It needs a bit of work to improve error checking etc. You can find the resulting code in my mysql-fabric-scripts project. The file stack.json can be uploaded directly into the Amazon Web Console.

Installing the latest MySQL 5.6 on Amazon Linux using official repos

In case you missed the announcement Monday, there are now official yum repositories for Fedora 18, 19 and Enterprise Linux 6 (Red Hat and Oracle Linux).

In my case however, I often use Amazon Web Services, and I find that Amazon Linux is just a little bit friendlier to use, since it has better mirror configuration out of the box.

The good news for me is that the official MySQL yum repositories (unofficially) work just fine. Here is an example installation:

sudo yum localinstall http://repo.mysql.com/mysql-community-release-el6-3.noarch.rpm
sudo yum install mysql-community-server
sudo service mysqld start

Configure mysqld to start on boot:

sudo chkconfig mysqld on
chkconfig --list mysqld

And that's it, you're all set up and running. One thing I noticed, is the configuration file included is very clean! This is because MySQL has improved all the defaults in 5.6:

# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html

[mysqld]
#
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M
#
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
#
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

# Recommended in standard MySQL setup
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

It doesn't get much simpler than this. I love it!

Norvald has also blogged about some of the other improvements to the packaging (systemd,
tmpfiles.d, SELinux, multilib).

Thoughts on "Amazon Offsite Backup"

A lot of people that ask me what my plan is "when Amazon goes down". It is hard to answer this question directly, since I think most users still see AWS as one cloud or global all-encompassing service. In reality it is:

  • Eight regions (9 if you count gov cloud) in different parts of the world.
  • Multiple availability zones in each region, providing physical isolation.

Amazon's default advice is that it is your responsibility to make sure your application can survive an Availability Zone outage - and in my case I almost can: databases are Multi-AZ, webservers are Multi-AZ. The only piece of infrastructure that currently violates this is a search service that ties us to us-east-1a via an EBS volume.

However, Availability Zones won't cover regionalized disasters, such as Hurricane Sandy, and it won't cover all of Amazon's oopses.

For the applications which we need higher availability than multi-AZ, I would much rather exhaust all of AWS's seven other regions since I can guarantee 100% compatible APIs. When I've finished with this list, to me it's time to start looking at third party providers. I think only a few edge cases fit in this category, such as NSD existing to increase the gene pool against software flaws/exploits.

It is also very easy to purchase a DNS service with latency-based routing and failover (via a probe URL you can specify) with providers like DynDNS and Neustar's UltraDNS to implement an active/passive or an active/active (requires application support). AWS even announced DNS based failover this year, but at the moment it has a critical limitation that it can not health check its load balancers. Maybe in the future this will get even easier!

Benchmarking & Prewarming Amazon ELBs

One of the things we discovered when benchmarking our improvements to OpenX 2.6 - is that it is actually very difficult to do so on EC2. How I assume ELBs work inside Amazon, is that they are built on top of EC2 instances, and you start off with one EC2 instance per availability zone you have selected. The load balancers are then load balanced themselves via DNS round-robin. This allows Amazon to treat every AZ as physically isolated without cross-talk interdependencies.

So now the part where I said it is difficult:

  • If you fire traffic at your load balancer in a naive way - what you will often find is that you always hit just a single load balancer in one availability zone. This seems to maybe max out at 20K requests/minute even if you have sufficient capacity behind the balancers.

  • Even if you fire traffic from multiple locations to get around the cached DNS result, it still starts off scaled down. Like I said above, I think you start off with one EC2 instance per availability zone selected. Amazon seems to employ their own auto scaling to detect how much capacity you need and expand the resources based on this. From my anecdotal evidence you should expect this to take 30 minutes to 1hr.

We went live in December 2011 with our OpenX 2.6 changes actually knowing/having discovered this pre-warming limitation, but expecting it to be closer to 20 minutes (we were on a short deadline running out of capacity in our data center). It was a test of nerves to say the least.

What I know now, is that you don't have to take the hit at all. All you need to do is buy support with Amazon, and then open a ticket and ask them to manually scale up to be able to handle X requests. They will ask you to specify a timeframe you need this manual scaling for (since they don't like to keep things in manual mode), but other than that this avoids all the pains I spoke of. Fast forward to 2012, and I managed to serve 242K requests/minute peak during an Apple product launch, and servers didn't break a sweat.

Scaling OpenX Click and Impression Logging

By coincidence, my last two jobs I've been in have seen me administer a very high traffic ad server. Or maybe not so much a coincidence, since InnoDB is a great OLTP database with a focus on consistent response time - not just raw throughput. It is also very well optimized for the simpler style queries (primary key lookups, simple secondary key etc) which seems to be all that is needed to serve ads.

Anyway, in 2011/2012 I found myself managing OpenX - and discovered some quirks with it.

OpenX 2.6

When I first inherited an OpenX 2.6 install, it was configured in a mode called distributed statistics. What this means is that you have one central database server, and then a series of slave MySQL servers. The impression and click recording web servers write to a slave server, which then has a cron that batches up and writes to the master database server. The master database server always has a copy of all the data - and has a cron that runs a summarization of the data to feed reports.

What I didn't like about this design is that the master always required all of the data anyway. so the slaves weren't necessarily adding any capacity. What they were doing however was not without benefit - they were a poor man's message queue. As data arrived there was some opportunity for flow control so your peaks would be smoothed out. It also enable you to optimize for throughput rather than response and put load on the master without impact to ad serving.

Message queues are a great part of any architecture - I am in love with them. However, in OpenX particular implementation, since the slaves didn't have all the data, and the master's summarization used temporary tables it also caused a lot of problems. We couldn't upgrade to row-based replication because the summarization process created temporary tables with the TYPE=X syntax which had since been deprecated - locking us into an earlier version of MySQL as well.

I also profiled the code and noticed that our performance was very erratic. Sometimes requests were super fast, other times they would wait minutes. What I discovered was that the cron that ran on the slave servers was blocking - when it ran every 5 minutes (configurable) it would CHECK TABLE, then REPAIR TABLE if required! - a paranoid design that is centered around MyISAM. InnoDB has internal consistency checking with page checksums and doesn't require this. So the first thing I did to improve OpenX 2.6 performance was comment out CHECK TABLE/REPAIR TABLE, and things were instantly much better.

When Open 2.6 was released in 2008, maybe more people had single core machines and web servers processed less traffic - so they may not have noticed exactly how hot of a lock waiting on CHECK TABLE/REPAIR TABLE was - which is a single threaded process inside MySQL. Since our slave servers sat on the web servers, we had these locks fairly well distributed, and since I don't think the servers synced with NTP, this managed to stay functional for years because machines would fall in and out of service from the load balancer on different cycles. However, as traffic per server increased, it also got much worse.

I also changed the slave servers to not be on the web servers - but a separate tier of servers, which supported x3 web servers per slave without any issues. To me this was more elegant architecturally, because of the time consuming process of rebuilding slaves as they failed due to the master's use of temporary tables, and that database servers tend to have uniquely different needs to web servers (more memory, faster disks, CPUs less important). However, my little architecture optimization presented me with another problem - the cron that ran on the slaves was consuming too much memory and dying!

What I discovered was that it was simply reading from a few tables, and then inserting those rows on the master (like I said above - a poor man's message queue). But it did this in PHP and used all sorts of associative arrays that were not memory efficient. I managed to reverse engineer and change this cron to be just a simple shell script which used the MySQL CLI and was incredibly more efficient. It also meant that my CHECK TABLE/REPAIR TABLE patch was obsolete - which made me comfortable having slightly less OpenX code to maintain.

However, OpenX 2.6 was still a beast at best. The 'monthly' part of the maintenance script on the master couldn't delete old impression rows as fast as new ones came in - and the total impressions table was over 1TB and caused a lot of locking inside InnoDB! So I commented out the monthly cleanup code and wrote a standalone script to constantly cleanup old versions of the data. However, my standalone script wasn't fast enough either, so every month we would temporarily disable the slave crons and master cron while we created a new table like the previous table and then inserted just the small fraction of data we required, renaming and dropping the old table. It was very high effort to maintain, and not particularly operationally friendly.

OpenX 2.8

OpenX obviously figured out on their own that there was a problem with how they were collecting data, because with 2.8 they changed from using raw impression logging of one row per impression to something they called Bucket Based Logging . What they would do, is record only a running count of impressions per ad-zone/creative-id/time-period, and perform an insert on duplicate key update count = count+1. They still had the ability to have "distributed" statistics with the slaves having a part of the data, and the maintenance on the slaves would have a special update to the master with an insert on duplicate key update count = count + my_number.

Taking from my earlier thought process where the slaves are really just message queues, I managed to find a really non invasive way to modify the OpenX code to not write to MySQL, but instead write to a message queue - in my case ZeroMQ. I then wrote some proof of concept code (that a colleague of mine later improved) which would receive from the message queue, and then merge requests by ad-zone/creative-id/time-period and then insert directly onto the master server. Eliminating those pesky slaves!

I left before the project could go live - and from what I heard afterwards, it unfortunately never did. But when I reflect on it, I really don't like OpenX's 2.8 bucket based logging and from an operational perspective - I think this would have still been hire maintenance. If one slave was too slow to phone home statistics before the master ran its summary cron, you will be missing statistics. And since the on duplicate key update count = count+1 style design is not idempotent - you can't easily regenerate statistics to fix data. You are stuck with inaccurate results.

What I would have done now

I would have kept the bucket based logging in 2.8, but cut out the message queue. I would have changed the code that records the click and/or impression to essentially be a no-op, and configured Apache so that it would write out a log file every 5 minutes in a format that records the fields I need:

{% raw %}
# /etc/httpd/conf.d/openx.conf
LogFormat "%{%Y%m%d-%H%M%S}t\t%>s\t%r\t%{X-Forwarded-For}i\t%{User-agent}i\t%{Referer}i" openxFmt
CustomLog "|usr/sbin/rotatelogs /var/log/httpd/in/openx.log-%Y%m%d.%H%M%S 300" openxFmt
{% endraw %}

Then I would have written a cron to find all log files not currently open, and then summarize them similar to how my message queue reader did + upload the log file to permanent storage (in our case Amazon S3 would have worked). I don't think that MySQL is the correct place to store the raw data - but it's essential to be able to reprocess as necessary:

for LOGFILE in `ls /var/log/httpd/in/openx.*`; do

if /usr/sbin/lsof -c rotatelog | grep -F $LOGFILE; then
  echo "Log file $LOGFILE is currently open.  Skipping this iteration.";
else
 echo "Log file $LOGFILE is unopened.  Doing stuff..";
 doStuffAndMoveSomewhere();
fi
done

Keeping these log files in plain text on S3 make them easy victims for Amazon EMR - which can read and process directly, and then store the results in MySQL. I would have loved to have eliminated the summarization cron that ran on the master and turned it into a hive query. Then I would have used sqoop to write the results back to MySQL. I am sure refactoring the summarization/maintenance cron to Hive would have been difficult - there were a lot of strange behaviors in it, but it would have made things idempotent, reduced a lot of load on my MySQL servers (which are not good at the complex summarization queries) and meant I could avoid the temporary tables which made replication slaves hard. I would have kept one replication slave for HA.

Also - the more I think about it, being able to keep raw log files is so important. They are incredibly easy to compress and retain, and often take up a lot less space than any (row-store) database indexed equivalent would.