When does MySQL data get loaded in and out of cache?

A cold cache, or a poorly tuned cache can be responsible for a number of performance problems. If we look at the data and indexes of InnoDB, the cache responsible is called the InnoDB buffer pool.
In earlier versions of MySQL the default value for the setting innodb_buffer_pool_size was 8M – which is a little out of date when you consider the recommended value to be 50-80% of system memory. In MySQL 5.5 the default value was increased to 128M, which is a comprimise made for users that may install MySQL on a development machine and not have it running as a dedicated server. In production, it is unlikely that you will buy a new server with less than 64GB of RAM, so it is quite typical that this setting is 50GB+
So lets talk about the behaviour of the InnoDB buffer pool –

Up until and including MySQL 5.5

When MySQL starts up, the buffer pool is completely empty. As queries are executed, MySQL will determine which pages are required – and if they are not in memory, they will be loaded from disk.
Initially, performance will be very poor – since there will be 100% cache misses, but over time as the buffer pool fills this should reduce.
Provided there is enough memory – MySQL will just keep loading pages into cache, and warm to a state where it will not have to perform any reads to the disk at all. Only writes will go to disk.

From MySQL 5.6 onwards

Because server warm time is becoming a more serious problem (it always existed, but it is exacerbated by us now having an abundance of RAM, but if we are using hard drives, they are not really any faster), Oracle introduced a feature in MySQL 5.6 – buffer pool warming.
What it does is on demand or on shutdown saves the addresses (space_id + page_id) of InnoDB pages to permanent storage.
Upon startup, these pages can then automatically be loaded back into memory straight away so that we can have cache misses for less time.
This feature is not turned on by default, but I suspect in a future version it may be. There are a number of other advantages to pre-warming such as being able to sort and merge read requests and load the data in much faster. Also, since it is only the addresses of pages being saved it only takes 8 bytes to point to each 16KB page, and there are no risks if there have been modifications since the last address saving operation ran.

Cache evictions

So we’ve described the “best case” so far, which is that data only gets loaded into cache and performance keeps getting better and better. However in practice, our cache is never unlimited. When our buffer pool gets to the point of being completely full it will need to free space in order to load new pages into memory. The behavior of this ‘eviction’ is version specific.

Up until MySQL 5.5

InnoDB previously used a classic Least Recently Used (LRU) algorithm. What this means is that each page has an order in a list. I like to think of it as a “Hot or not” score. Each time a page is accessed it gets promoted up the list to be considered “more hot”. When it comes to freeing space, InnoDB just picks the least hot page and frees it from memory to make space for the next page.
In typical operation MySQL will not free pages from memory unless it needs to make space for another page. I say ‘typical’, because pages will be freed if a table is dropped for example.

MySQL 5.5 onwards

There are a number of operational limitations with using a classic LRU implementation on a database which has many users, and typically more than one workload. For example lets say that everything is working fine and we’ve figured out what is “hot or not”, noting that we do not need as much memory as we do data because in most cases there will be pages that are not accessed frequently. Now imagine that a series of queries come in from mysqldump that want to run tablescans to export the data. What can happen, is the pages loaded in from the tablescans push out some of the good pages in our buffer pool, and even worse, as soon as the mysqldump operation is complete, they will no longer be required. So now post-mysqldump we have random IO as we try and re-settle cache-contents again.
So in MySQL 5.5 the LRU was split into two sublists:
– A ‘young’ sublist for our frequently accessed pages (default: 63%)
– An ‘old’ sublist for our table scan page accesses (default: 37%)
The configuration variable innodb_old_blocks_pct was introduced to configure the old/new split, and a variable innodb_old_blocks_time (default in 5.5: 0, in 5.6: 1000) was introduced to specify a minimum amount of milliseconds that a freshly loaded page must stay in the old sublist before it is ellible to be promoted to the young sublist.
As mentioned in the manual, and in blog posts it frequently makes sense to configure the innodb_old_blocks_time to a larger value, for example 1000 (ms).

A note on page churn

If you do not have enough memory to hold all of your working set in memory, what will happen is that the buffer pool will start juggling as it churns pages out of memory only to load them back in soon after. A small amount of this is probably okay, but it can start to degrade performance. The traditional rule of thumb is called “the 5 minute rule“. This means that if you load a page into memory and are going to need it again in the next five minutes – it should not need to be churned out.


I find a good interview question for a MySQL DBA position is to ask what the following commands actually do in InnoDB, which has been the default storage engine since MySQL 5.5. From my perspective there is a lot of miss-understanding what still applies.


From the MySQL manual:

ANALYZE TABLE analyzes and stores the key distribution for a table. During the analysis, the table is locked with a read lock for InnoDB and MyISAM.

What this means is, as part of query optimization MySQL will often have to decide which is the best index if there are multiple candidates, which indexes should be avoided, and what order should tables be joined in. Indexes need to eliminate work – so if for example you were trying to index a column called “Country” in a table full of all people in the USA, then it would be faster to avoid that index.
What is also important to note, is that InnoDB will often update these statistics internally without you needing to do so, and that the ‘read lock’ that the manual describes is kind of weird (shameless plug for a bug I filed back in 2007).
Up until 5.6, statistics are in memory only, and very coarse – sampling just a small amount of data. With 5.6, there is now the new Persistent Optimizer Statistics enhancement as well as innodb_stats_auto_recalc is also a variable, and STATS_PERSISTENT, STATS_AUTO_RECALC, and STATS_SAMPLE_PAGES are configurable per table as CREATE TABLE options. It’s a shame features like this don’t get enough press.
I still find myself using ANALYZE TABLE as part of debugging slow queries to confirm that it is not a stale causing a problem, but I don’t find any routine operational use case for it.
Pro-tip: My explaination above with a “Country” table being filled with all “USA” was a bit of a simplification, since not all optimizer decisions will use these pre-computed statistics. I believe in many cases if we are just talking about a single table (and not trying to determine join order), the optimizer will just ask InnoDB to estimate how many records there are in a given range. I am hoping these sort of behaviors will be more exposed in future MySQL versions. I was very happy to see EXPLAIN for UPDATE and DELETE statements in MySQL 5.6 – and in some cases the new JSON EXPLAIN format seems to show more information.


CHECK TABLE to me is a MyISAM-ism. As noted in the manual:

If CHECK TABLE finds a problem for an InnoDB table, the server shuts down to prevent error propagation. Details of the error will be written to the error log.

InnoDB is an ACID compliant (durable) data store, so it doesn’t have the same inconsistency situations that MyISAM does. The behavior that the manual is describing here is the same behavior that happens if the server detects corruption through standard operation – which is detected via CRC checksums on each page it stores.
As Oli writes check table probably won’t work on very large tables (> 200-400 GB), so this command does not really have any practical use for me.


From the MySQL manual:

REPAIR TABLE only applies to MyISAM, ARCHIVE, and CSV tables. See Section 14.3, “The MyISAM Storage Engine”, and Section 14.6, “The ARCHIVE Storage Engine”, and Section 14.5, “The CSV Storage Engine”

So no use for this either.


For InnoDB, the wording in the in the MySQL manual and the error message is very specific on this one:

For InnoDB tables, OPTIMIZE TABLE is mapped to ALTER TABLE, which rebuilds the table to update index statistics and free unused space in the clustered index. Beginning with MySQL 5.1.27, this is displayed in the output of OPTIMIZE TABLE when you run it on an InnoDB table, as shown here:

mysql> OPTIMIZE TABLE foo;
| Table    | Op       | Msg_type | Msg_text                                                          |
| test.foo | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| test.foo | optimize | status   | OK                                                                |

So what this means, is that internally we create a new table – much like the existing table, then we trickle load the data into it one row at a time. For the clustered index (aka primary key) this may result in a significant space saving if data was inserted out of order, or if there have been modifications which have caused there to be some gaps which have affected fill-factor. It’s important to note while explaining this, that some gaps are expected – as InnoDB only fills pages 15/16ths full.
For the secondary key indexes they will be trickle loaded one row at a time in the order of the clustered index – which may result in them going straight back to being fragmented anyway. InnoDB’s implementation of MVCC does have multiple versions in the secondary indexes however, so it is possible that gaps may be reclaimed here.
In MySQL 5.5, InnoDB introduced a feature called “fast index create” that could create these secondary indexes more optimally by presorting the data first and then creating the index. However, this feature is not tied into OPTIMIZE TABLE (yet) in official Oracle MySQL releases. See Bug #57583.
Due to it’s massively high cost, I find I run OPTIMIZE a lot less than other people (read: almost never), and Baron Schwartz has even has a humorous way of describing it:

.. like something you’d hear from a naive Windows user who buys a $99 piece of software to make his PC “boot faster” or “fix his registry” or something.

Again, this post being related to InnoDB – OPTIMIZE was important for MyISAM Dynamic tables. Read the manual for more.

Fastest way to estimate rows in a table

A friend wrote to me recently with a question. He was working on a method to ship application metrics to statsd on a 1 minute interval. He had three examples of how to estimate the number of rows in a table and he wanted to know the difference between them.

Data length/average row length

{% raw %}
The example given:

WHERE TABLE_NAME = 'line_items';
|              10497541.7528 |
1 row in set (0.03 sec)

I have actually never thought of using this method! I don’t think it’s accurate though, since data length has deleted space + additional preallocated or overhead space. For example a page file is only 15/16ths in InnoDB. So as you can see the number it returns is just over 10% higher than the actual number of rows (9441296).

Table rows from Information Schema

The example given:

WHERE TABLE_NAME = 'line_items';
|    9800078 |
1 row in set (0.03 sec)

This method takes a number that InnoDB provides, which in this case is accurate to within 4% of the actual number of rows (estimating slightly over in this case). So the question is really about how efficient a count already provided is, and where does it come from.
InnoDB samples a number of random pages in the table, and then estimates the total rows for the whole table. By default this sampling is just 8 pages in MySQL 5.5 (configurable via innodb_stats_sample_pages), but is greatly improved in MySQL 5.6 – with 20 pages sampled by default. The option is now called innodb_stats_persistent_sample_pages – a reference to the new persistent statistics feature!
So based on it being a fixed number of pages to examine, it is also going to scale reasonably with table growth. Pro tip: It is quite possible it may look much slower as soon as the table does not fit in memory, since 8 random pages could mean > 8 random IOs.

Select count(1)

The example given:

mysql> SELECT COUNT(1) FROM line_items;
| COUNT(1) |
|  9441296 |
1 row in set (2.03 sec)

This requires an index scan of the primary key. It’s important to explain why that is, since this behavior differs from MyISAM. InnoDB supports MVCC which is an important feature to allow concurrent access to rows without having to need readers set locks blocking other users from writing. In a practical sense what this feature means, is that at any one point in time there will be multiple versions of a row. The actual count(1) will depend on the time your transaction started, and its isolation level.
This solution will not scale well as the number of rows in the table grows and while storage-engine development is outside of my expertise, I suspect it is unlikely that this will be improved in any future MySQL versions. My reasoning is that I can not see an easy way of maintaining multiple different pre-computed counts without introducing any new global locking or overhead – which is a big no-no in being able to scale on multiple cores/cpus.
Finally, of the three solutions, this is the only 100% accurate method to be able to tell the exact number of rows in the table.
{% endraw %}

Thoughts on "Amazon Offsite Backup"

A lot of people that ask me what my plan is “when Amazon goes down”. It is hard to answer this question directly, since I think most users still see AWS as one cloud or global all-encompassing service. In reality it is:

  • Eight regions (9 if you count gov cloud) in different parts of the world.
  • Multiple availability zones in each region, providing physical isolation.

Amazon’s default advice is that it is your responsibility to make sure your application can survive an Availability Zone outage – and in my case I almost can: databases are Multi-AZ, webservers are Multi-AZ. The only piece of infrastructure that currently violates this is a search service that ties us to us-east-1a via an EBS volume.
However, Availability Zones won’t cover regionalized disasters, such as Hurricane Sandy, and it won’t cover all of Amazon’s oopses.
For the applications which we need higher availability than multi-AZ, I would much rather exhaust all of AWS’s seven other regions since I can guarantee 100% compatible APIs. When I’ve finished with this list, to me it’s time to start looking at third party providers. I think only a few edge cases fit in this category, such as NSD existing to increase the gene pool against software flaws/exploits.
It is also very easy to purchase a DNS service with latency-based routing and failover (via a probe URL you can specify) with providers like DynDNS and Neustar’s UltraDNS to implement an active/passive or an active/active (requires application support). AWS even announced DNS based failover this year, but at the moment it has a critical limitation that it can not health check its load balancers. Maybe in the future this will get even easier!

Benchmarking & Prewarming Amazon ELBs

One of the things we discovered when benchmarking our improvements to OpenX 2.6 – is that it is actually very difficult to do so on EC2. How I assume ELBs work inside Amazon, is that they are built on top of EC2 instances, and you start off with one EC2 instance per availability zone you have selected. The load balancers are then load balanced themselves via DNS round-robin. This allows Amazon to treat every AZ as physically isolated without cross-talk interdependencies.
So now the part where I said it is difficult:

  • If you fire traffic at your load balancer in a naive way – what you will often find is that you always hit just a single load balancer in one availability zone. This seems to maybe max out at 20K requests/minute even if you have sufficient capacity behind the balancers.

  • Even if you fire traffic from multiple locations to get around the cached DNS result, it still starts off scaled down. Like I said above, I think you start off with one EC2 instance per availability zone selected. Amazon seems to employ their own auto scaling to detect how much capacity you need and expand the resources based on this. From my anecdotal evidence you should expect this to take 30 minutes to 1hr.

We went live in December 2011 with our OpenX 2.6 changes actually knowing/having discovered this pre-warming limitation, but expecting it to be closer to 20 minutes (we were on a short deadline running out of capacity in our data center). It was a test of nerves to say the least.
What I know now, is that you don’t have to take the hit at all. All you need to do is buy support with Amazon, and then open a ticket and ask them to manually scale up to be able to handle X requests. They will ask you to specify a timeframe you need this manual scaling for (since they don’t like to keep things in manual mode), but other than that this avoids all the pains I spoke of. Fast forward to 2012, and I managed to serve 242K requests/minute peak during an Apple product launch, and servers didn’t break a sweat.

Scaling OpenX Click and Impression Logging

By coincidence, my last two jobs I’ve been in have seen me administer a very high traffic ad server. Or maybe not so much a coincidence, since InnoDB is a great OLTP database with a focus on consistent response time – not just raw throughput. It is also very well optimized for the simpler style queries (primary key lookups, simple secondary key etc) which seems to be all that is needed to serve ads.
Anyway, in 2011/2012 I found myself managing OpenX – and discovered some quirks with it.

OpenX 2.6

When I first inherited an OpenX 2.6 install, it was configured in a mode called distributed statistics. What this means is that you have one central database server, and then a series of slave MySQL servers. The impression and click recording web servers write to a slave server, which then has a cron that batches up and writes to the master database server. The master database server always has a copy of all the data – and has a cron that runs a summarization of the data to feed reports.
What I didn’t like about this design is that the master always required all of the data anyway. so the slaves weren’t necessarily adding any capacity. What they were doing however was not without benefit – they were a poor man’s message queue. As data arrived there was some opportunity for flow control so your peaks would be smoothed out. It also enable you to optimize for throughput rather than response and put load on the master without impact to ad serving.
Message queues are a great part of any architecture – I am in love with them. However, in OpenX particular implementation, since the slaves didn’t have all the data, and the master’s summarization used temporary tables it also caused a lot of problems. We couldn’t upgrade to row-based replication because the summarization process created temporary tables with the TYPE=X syntax which had since been deprecated – locking us into an earlier version of MySQL as well.
I also profiled the code and noticed that our performance was very erratic. Sometimes requests were super fast, other times they would wait minutes. What I discovered was that the cron that ran on the slave servers was blocking – when it ran every 5 minutes (configurable) it would CHECK TABLE, then REPAIR TABLE if required! – a paranoid design that is centered around MyISAM. InnoDB has internal consistency checking with page checksums and doesn’t require this. So the first thing I did to improve OpenX 2.6 performance was comment out CHECK TABLE/REPAIR TABLE, and things were instantly much better.
When Open 2.6 was released in 2008, maybe more people had single core machines and web servers processed less traffic – so they may not have noticed exactly how hot of a lock waiting on CHECK TABLE/REPAIR TABLE was – which is a single threaded process inside MySQL. Since our slave servers sat on the web servers, we had these locks fairly well distributed, and since I don’t think the servers synced with NTP, this managed to stay functional for years because machines would fall in and out of service from the load balancer on different cycles. However, as traffic per server increased, it also got much worse.
I also changed the slave servers to not be on the web servers – but a separate tier of servers, which supported x3 web servers per slave without any issues. To me this was more elegant architecturally, because of the time consuming process of rebuilding slaves as they failed due to the master’s use of temporary tables, and that database servers tend to have uniquely different needs to web servers (more memory, faster disks, CPUs less important). However, my little architecture optimization presented me with another problem – the cron that ran on the slaves was consuming too much memory and dying!
What I discovered was that it was simply reading from a few tables, and then inserting those rows on the master (like I said above – a poor man’s message queue). But it did this in PHP and used all sorts of associative arrays that were not memory efficient. I managed to reverse engineer and change this cron to be just a simple shell script which used the MySQL CLI and was incredibly more efficient. It also meant that my CHECK TABLE/REPAIR TABLE patch was obsolete – which made me comfortable having slightly less OpenX code to maintain.
However, OpenX 2.6 was still a beast at best. The ‘monthly’ part of the maintenance script on the master couldn’t delete old impression rows as fast as new ones came in – and the total impressions table was over 1TB and caused a lot of locking inside InnoDB! So I commented out the monthly cleanup code and wrote a standalone script to constantly cleanup old versions of the data. However, my standalone script wasn’t fast enough either, so every month we would temporarily disable the slave crons and master cron while we created a new table like the previous table and then inserted just the small fraction of data we required, renaming and dropping the old table. It was very high effort to maintain, and not particularly operationally friendly.

OpenX 2.8

OpenX obviously figured out on their own that there was a problem with how they were collecting data, because with 2.8 they changed from using raw impression logging of one row per impression to something they called Bucket Based Logging . What they would do, is record only a running count of impressions per ad-zone/creative-id/time-period, and perform an insert on duplicate key update count = count+1. They still had the ability to have “distributed” statistics with the slaves having a part of the data, and the maintenance on the slaves would have a special update to the master with an insert on duplicate key update count = count + my_number.
Taking from my earlier thought process where the slaves are really just message queues, I managed to find a really non invasive way to modify the OpenX code to not write to MySQL, but instead write to a message queue – in my case ZeroMQ. I then wrote some proof of concept code (that a colleague of mine later improved) which would receive from the message queue, and then merge requests by ad-zone/creative-id/time-period and then insert directly onto the master server. Eliminating those pesky slaves!
I left before the project could go live – and from what I heard afterwards, it unfortunately never did. But when I reflect on it, I really don’t like OpenX’s 2.8 bucket based logging and from an operational perspective – I think this would have still been hire maintenance. If one slave was too slow to phone home statistics before the master ran its summary cron, you will be missing statistics. And since the on duplicate key update count = count+1 style design is not idempotent – you can’t easily regenerate statistics to fix data. You are stuck with inaccurate results.

What I would have done now

I would have kept the bucket based logging in 2.8, but cut out the message queue. I would have changed the code that records the click and/or impression to essentially be a no-op, and configured Apache so that it would write out a log file every 5 minutes in a format that records the fields I need:
{% raw %}
# /etc/httpd/conf.d/openx.conf
LogFormat “%{%Y%m%d-%H%M%S}t\t%>s\t%r\t%{X-Forwarded-For}i\t%{User-agent}i\t%{Referer}i” openxFmt
CustomLog “|usr/sbin/rotatelogs /var/log/httpd/in/openx.log-%Y%m%d.%H%M%S 300” openxFmt
{% endraw %}
Then I would have written a cron to find all log files not currently open, and then summarize them similar to how my message queue reader did + upload the log file to permanent storage (in our case Amazon S3 would have worked). I don’t think that MySQL is the correct place to store the raw data – but it’s essential to be able to reprocess as necessary:

for LOGFILE in `ls /var/log/httpd/in/openx.*`; do
if /usr/sbin/lsof -c rotatelog | grep -F $LOGFILE; then
  echo "Log file $LOGFILE is currently open.  Skipping this iteration.";
 echo "Log file $LOGFILE is unopened.  Doing stuff..";

Keeping these log files in plain text on S3 make them easy victims for Amazon EMR – which can read and process directly, and then store the results in MySQL. I would have loved to have eliminated the summarization cron that ran on the master and turned it into a hive query. Then I would have used sqoop to write the results back to MySQL. I am sure refactoring the summarization/maintenance cron to Hive would have been difficult – there were a lot of strange behaviors in it, but it would have made things idempotent, reduced a lot of load on my MySQL servers (which are not good at the complex summarization queries) and meant I could avoid the temporary tables which made replication slaves hard. I would have kept one replication slave for HA.
Also – the more I think about it, being able to keep raw log files is so important. They are incredibly easy to compress and retain, and often take up a lot less space than any (row-store) database indexed equivalent would.