Estimating MySQL's Working Set with information_schema

In an earlier post, I wrote about when MySQL loads data in and out of cache. At the end of this post, I talked about the concept of page churn, and having enough memory for your working set:

“If you do not have enough memory to hold all of your working set in memory, what will happen is that the buffer pool will start juggling as it churns pages out of memory only to load them back in soon after. A small amount of this is probably okay, but it can start to degrade performance. The traditional rule of thumb is called “the 5 minute rule“. This means that if you load a page into memory and are going to need it again in the next five minutes – it should not need to be churned out.”

To be able to calculate this in a precise way we would need to collect:
– The current number of pages in the buffer pool.
– The unique pages loaded in over a 5 minute period.
MySQL doesn’t have a hook to be able to tell when a page is loaded in memory, but as of MySQL 5.6 it does have an information_schema table called INNODB_BUFFER_PAGE. So while we will not be able to get the exacty number, polling this table on a frequent enough interval and storing the unique results (space_id + page_id) to a temporary table should get close.
So I wrote a stored procedure estimate_working_set.sql to do exactly this. It accepts two arguments; sleep time and iterations. Here is an example ~5 minute observation on my laptop:

mysql> call test.estimate_working_set(10, 30);
+----------+
| progress |
+----------+
| 1/30     |
+----------+
1 row in set (10.13 sec)
+----------+
| progress |
+----------+
| 2/30     |
+----------+
1 row in set (21.19 sec)
.. lines ommitted for brevity ..
+----------+
| progress |
+----------+
| 29/30    |
+----------+
1 row in set (5 min 41.97 sec)
+----------+
| progress |
+----------+
| 30/30    |
+----------+
1 row in set (5 min 54.72 sec)
+----------------------+
| pages_in_working_set |
+----------------------+
|               100679 |
+----------------------+
1 row in set (5 min 55.61 sec)
Query OK, 0 rows affected (5 min 55.71 sec)

So in my case my working set is 100679 pages. I am not using compressed pages so each page is 16 KiB, or divide by 64 to convert to MiB. 100679/64 = 1573MB working set. I have a 128M buffer pool – so I desperately need to increase it / add more memory.
Warning! This stored procedure will be expensive on a server with a large buffer pool. Use at your own risk!

Optimizing IN Subqueries in MySQL 5.6

I thought I would take the new subquery optimizations in MySQL 5.6 for a spin today, using the world sample database provided by MySQL for certification and training.

Typical IN subquery

This is a very typical query developers run, which historically has performed very poorly on MySQL:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Notice that in MySQL 5.6 – the very first table accessed is Country instead of City. MySQL 5.5 was not able to recognize this as a constant, and instead executed this as a DEPENDENT SUBQUERY (aka Correlated subquery) for each row it found in the city table (an estimated 3984 rows)!
MySQL 5.6 still has a table scan on Country, but I can address that with an index on Country.name:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY,Name
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ref
possible_keys: PRIMARY,Name
          key: Name
      key_len: 52
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

The index doesn’t affect MySQL 5.5 – which still executes as a DEPENDENT SUBQUERY, but take a look at MySQL 5.6 – 1 row from the Country table (from an index!) and then 1 row from the City table. This optimizes great!

More complex IN example

In this example I thought I would try to find all cities in the country with the largest population. My first attempt was to see if I could now use a LIMIT in a subquery. It looks like I’ll have to wait a bit longer:

mysql5.5.31 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
mysql5.6.11 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So here is my second attempt:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 219
        Extra:
3 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: NULL
3 rows in set (0.00 sec)

MySQL 5.5 could always optimize the population = scalar subquery, but not the IN subquery. Similar to the above example, I would expect the subqueries here should be unraveled as constants as well. If I add an index on population you can really see this happen:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY,Population
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE country add index (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.6.11 > EXPLAIN select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ref
possible_keys: PRIMARY,Population
          key: Population
      key_len: 4
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)

This is looking really good in MySQL 5.6. I had a bit of a huh? moment when trying to read what the Select tables optimized away step #3 meant. This led me to try using EXPLAIN EXTENDED where I discovered a little gem:

mysql5.5.31 > SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`city` where
 <in_optimizer>(`world`.`city`.`CountryCode`,
 <exists>(<primary_index_lookup>(<cache>(`world`.`city`.`CountryCode`) in
 country on PRIMARY where ((`world`.`country`.`Population` =
 (select max(`world`.`country`.`Population`) from `world`.`country`)) and
 (<cache>(`world`.`city`.`CountryCode`) = `world`.`country`.`Code`)))))
1 row in set (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`country` join `world`.`city`
where ((`world`.`city`.`CountryCode` = `world`.`country`.`Code`)
and (`world`.`country`.`Population` = (/* select#3 */
select max(`world`.`country`.`Population`) from `world`.`country`)))
1 row in set (0.00 sec)

EXPLAIN EXTENDED writes the approximate query that MySQL is going to execute after the optimizer has applied any optimizations or transformations. This has been enhanced in MySQL 5.6 to add a comment for each step in the query execution (IDs match up to those in EXPLAIN). So if it was ever unclear, it is clearly this portion that is being optimized away:

mysql5.5.31 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)

I believe what’s happening here, is during optimization MySQL opens the index population and looks at the last value (very cheap on a B-tree). So it kind of cheats and does some of the work before it has to. I’ve seen it do this before, here is a more common example of this cheating happening:

mysql5.5.31 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra:
1 row in set, 1 warning (0.00 sec)
mysql5.5.31 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS `CountryCode`,
'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` where 1
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS
 `CountryCode`,'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city`
WHERE 1
1 row in set (0.00 sec)

Anyway, back to my original query. With the nesting of my IN queries I sometimes find it difficult to read the output of EXPLAIN and understand the order of execution. MySQL 5.6 also has FORMAT=JSON, which looks much nicer to me and it includes more information:

mysql5.6.11 > EXPLAIN format=json select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "nested_loop": [
      {
        "table": {
          "table_name": "country",
          "access_type": "ref",
          "possible_keys": [
            "PRIMARY",
            "Population"
          ],
          "key": "Population",
          "used_key_parts": [
            "Population"
          ],
          "key_length": "4",
          "ref": [
            "const"
          ],
          "rows": 1,
          "filtered": 100,
          "using_index": true,
          "attached_condition": "(`world`.`country`.`Population` = (/* select#3 */
          select max(`world`.`country`.`Population`) from `world`.`country`))",
          "attached_subqueries": [
            {
              "dependent": false,
              "cacheable": true,
              "query_block": {
                "select_id": 3,
                "table": {
                  "message": "Select tables optimized away"
                }
              }
            }
          ]
        }
      },
      {
        "table": {
          "table_name": "City",
          "access_type": "ref",
          "possible_keys": [
            "CountryCode"
          ],
          "key": "CountryCode",
          "used_key_parts": [
            "CountryCode"
          ],
          "key_length": "3",
          "ref": [
            "world.country.Code"
          ],
          "rows": 1,
          "filtered": 100
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.00 sec)

While it’s possible that these queries could have been rewritten to be efficient joins, I really like seeing query optimizations being introduced to eliminate common paper cuts. Improving diagnostic features doesn’t hurt either 😉 I’m really looking forward to what tools can be built to take advantage of the JSON explain output.

Review of SQL Performance Explained by Markus Winand

I picked up a copy of SQL Performance Explained last week, after having been a long time fan of Markus’ site Use The Index, Luke!.
What I love the most about use-the-index-luke, is the 3-minute-test. Seriously – try taking it!
Anyway, here is what I have to say about the book:

SQL Performance Explained
  • Quality : The book is not published by a major publisher, so I was not sure if I should expect a bound stack of photocopies. To my delight, it is the same quality as any other book, and clearly has a copy editor. I didn’t notice any gramatical or copy errors.

  • Length : It is a short book at 164 pages + 28 pages of appendixes. I personally love this – because the author has focused on what is important, and not added introductory chapters at the pressure of a publisher to expand page count. I kept wondering if it could be smaller (in an effort to make sure every developer on a team reads it), but I don’t think this is reasonable either. I thought it was smart in the preface to say that the book only covers B-tree indexes, and the length was therefor “about right”.

  • Content : Very high quality content – it is simply amazing that one author has this depth of knowledge on 4 different database vendors (Oracle, SQL Server, MySQL and PostgreSQL). I like the organization which is largely around statement level behavior, and keeps it a developer-focused book. I also like that the first chapter dives right into the thick of it by describing the anatomy of an index. The theory aspect of this chapter may be a little bit difficult for developers with less of a low-level background to grasp in one read – but I couldn’t think of any other way around it. It covered a good/appropriate level of detail.

  • Small Criticisms : Sometimes it is difficult to be generic in advice, and keep in mind that this book covers Oracle, SQL Server, MySQL and PostgreSQL. I think the book did pretty well – but I did notice an example in Chapter 2, where it describes an index range scan (p39) that I know is not possible until Index Condition Pushdown in MySQL 5.6. It did warn that composite indexes should be in the form const, range but for MySQL the wording probably should have been a little stronger. I don’t want you to take this criticism too seriously as well. If the author were to give database specific and version specific advice, it would make for a very dull read.

Overall:
As the author himself says, SQL does a good job at separating out what data is needed from how data will actually be retrieved. This makes performance characteristics not always clear cut.
I think that for a development team building database applications everyone understanding these basic characteristics is like a chef knowing how to cook an omelette – it is such a critical fundamental that will never go away.
If there are developers that aren’t getting 5/5 on the Use the Index, Luke! 3-minute-test, then I can’t think of a better resource to get them there. The book is a short read, and to the point!
Final rating: 4.5 stars.
Buy the book online from sql-performance-explained.com.

Experimenting with MySQL 5.7

I was playing around with MySQL 5.7 this weekend and before having read the changelog, I managed to spot these two little gems.

Duplicate Indexes

“The server now issues a warning if an index is created that duplicates an existing index, or an error in strict SQL mode.” Bug #37520

Example Testcase:

mysql> SHOW CREATE TABLE city\G
*************************** 1. row ***************************
       Table: city
Create Table: CREATE TABLE `city` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Name` char(35) NOT NULL DEFAULT '',
  `CountryCode` char(3) NOT NULL DEFAULT '',
  `District` char(20) NOT NULL DEFAULT '',
  `Population` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`ID`),
  KEY `CountryCode` (`CountryCode`),
  CONSTRAINT `city_ibfk_1` FOREIGN KEY (`CountryCode`) REFERENCES `Country` (`Code`)
) ENGINE=InnoDB AUTO_INCREMENT=4080 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql> ALTER TABLE city add index (countrycode);
ERROR 1831 (HY000): Duplicate index 'CountryCode_2' defined on the table 'world.city'.
This is deprecated and will be disallowed in a future release.

Pretty cool – I know this previously caught a lot of people.

Control-C support in the client

“Previously, Control+C in mysql interrupted the current statement if there was one, or exited mysql if not. Now Control+C interrupts the current statement if there was one, or cancels any partial input line otherwise, but does not exit.” Bug #66583

Example Testcase:

mysql> this is a test
    -> test
    -> test
    -> ^C

So if I want to quit, I can now control-C then type “quit”. This is much more intuitive.

Twice as much RAM != Twice as Fast

We use Amazon RDS for database servers and historically have had a practice of having smaller database server in QA and DEV to save cost. I would consider this a pretty common practice – but what I have also discovered is that it can lead to fixing non-bugs, or issues that will not arrise in production.
Let me try and describe a (~) worst case and some simple math to illustrate:
Production
– 10G of RAM
– 100 IOPS Storage Device (single hard disk)
– Workload is 100% reads
– 10K queries/second
QA
– 9G of RAM
– 100 IOPS Storage Device (single hard disk)
– Same workload (100% reads)
– Can only do 1K queries/second!
Production is serving its 10K queries/second with a 100% cache hit rate, but with the 1G less RAM, QA is only able to serve with a 99% cache hit rate. Each query touches on average 10 distinct pages. So:

10K queries * 10 pages
= 100K pages/second
100K pages * 0.01 cache miss
= 1OOO IOPS second required to cover misses.

.. since we only have 100 IOPS, we can only serve 1/10th of the queries!
In our case it was more like production having 68G RAM and QA having 34G. The query volume itself was also not necessarily as high, but the queries all needed to access a very large number of distinct pages – causing the same cache miss, and page churn.

What MySQL is really good at!

With databases, choices of algorithms influence performance characteristics significantly. It is often very easy to find a situation where one database will perform much worse than others, which is why you will hear database engine developers cry out that benchmarks are flawed.
The best benchmark is one that closely matches what your application actually does, which is why you see the TPC create benchmarks to match hypothetical situations – like a warehouse that has inventory arriving and being shipped out all the time. These situations in turn create “workloads” on the database. To give some context, a workload may perform differently on a different database engine because of how many concurrent modifications are required, the ratio of reads/writes, how much data is modified in a transaction, and where in the data set the reads and writes are (are there hot records or hot tables?). These are just examples – in reality there are too many factors to list.
Which gets me to my main point – what workloads is MySQL with the InnoDB storage engine really good at? In answering this, I really want to try to focus on core data structures and algorithms more than specific features that may or may not be implemented.

Total data size may exceed memory but working set does not

If you have for example 100G of data, you do not necessarily require 100G of RAM. MySQL will adjust and keep only the most frequently accessed data in memory (aka working set). This behavior is the result of using a B+Tree index. To simplify and compare this to say memcached where the index is a hash – it really requires as much memory as there is data.
(Note: Memcached of course addresses this by allowing you to distribute the index across many nodes, and hash indexes are generally smaller, so please don’t miss the key point I am making here.)

Modifications are small and transactions are short

When you modify a row in InnoDB, the old version of it is relocated to undo space, and the ‘new’ version is written in place (implementing MVCC). When all existing transactions have finished with the old version of the row, along comes a purge process.
Not all databases implement MVCC this way. For example another way of implementing MVCC is to just append/write new rows in the same table-space. The advantage of the relocation method is that if there is sufficient memory to keep the undo pages in memory (and they are not required for very long) is that this relocation may be a logical relocation only. i.e. physical IO relocation doesn’t have to happen. However, in the case that there is not enough memory/the undo pages need to be evicted from the buffer pool, then I could see this potentially taking more effort. I believe rolling back large transactions may be more effort with InnoDB’s design as well – although I have not put much thought into it.

Primary key lookups are fast

InnoDB implements a Clustered Index on the primary key. Clustering in this case refers to a close arrangement/organization on disk, not multiple compute instances working together. This makes primary key lookups in InnoDB particularly fast – although admitedly probably not as fast as databases that impliment a hash index.

Indexed ranged lookups are supported

Some index structures are suitable for fixed primary key lookups, or emulate other types of access by pre-computing a cache of values and packing them in one row/column. Because InnoDB uses a B+Tree index, it is very flexible in being able to support ranged lookups on primary or secondary keys (example: find rows BETWEEN x AND y). These indexes are also pre-sorted, which (depending on the query) can allow the rows to be efficiently delivered as they are read rather than buffered first.

Query performance is highly uniform

InnoDB is optimized to provide stable response times to user-serving applications, rather than just peak throughput in queries per second. There are a number of features which help deliver this: converting as much IO as possible to be in the background with the transaction log files, adaptive flushing to make sure that background work doesn’t fall behind, even the implimentation of MVCC I described above.
InnoDB also supports covering indexes, which means that for queries that do not SELECT * they sometimes have a chance to retrieve all data from the secondary index. For some cases where not all data fits in memory covering indexes can be super critical to uniformity of performance because the number of unique pages that will need to be looked at is considerably lower.

Table has only a primary key and non-unique secondary keys

So I mentioned already that total data-size can exceed main memory, but InnoDB also has a novel mechanism to delay INSERT/UPDATE/DELETE modifications to secondary indexes if the pages are not loaded into memory. What it will do is buffer the changes in a secondary structure, and then make the modification the next time the page is loaded into memory, or in the background provided the server is not under load. This feature only works when index is not unique, because otherwise it is not safe to make a modification not knowing if it violates a constraint. Ideally, the primary key also inserts data in order.
(In previous versions of InnoDB, this feature was described as the insert buffer).
..
What am I missing? Please try and focus on the algorithms not the features, and the positive not the negative!

When does MySQL perform IO?

In my previous post, I wrote about when data gets loaded in and out of cache. The goal of caching is to avoid having to perform disk IO, which if we are talking about hard drives, each operation is about 20 million times slower than CPUs or about 100 thousand times slower than memory.
So today I want to cover the subject of when are we required to do IO? But before I get there I first have to describe how IO works on all our modern operating systems.

An Introduction to Buffered IO

When you read or write to a file, by default you are speaking to a layer in between you and your disks called buffered IO. This layer is designed to increase performance and adds important features such as caching. Without going into this in detail, it’s important to note that:

  • When you READ data, you may be reading from the disk or a cache provided by the operating system.

  • When you WRITE data you are not writing to the disks, but to a queue which will be flushed in the background. This queue is a little bit different to how a message queue in an application may behave: it usually has a hard limit to the number of outstanding requests, and it does not always complete requests in purely FIFO order – but an optimized order that still makes sure there is a maximum time window a request can take (example).

    • WRITE operations are not crash-safe by default! The operating system does provide a way of syncing a file to make sure all of the buffers are flushed down to the disk; an fsync.

    • An fsync is usually very expensive because it forces the queue to empty, and reduces the amount of request reordering and merging. Sequential IO is really important for hard drives.

So now lets look at IO created by a MySQL Server using the InnoDB storage engine (default since MySQL 5.5).

Reading pages into memory

If a page is not located in cache already, InnoDB will load it into memory in the foreground while queries are waiting, unless the page loading was triggered via a read ahead operation in which case it will use one of its innodb_read_io_threads to make this happen in the background.

Writing modified pages back to disk

When you make a modification to an InnoDB page, InnoDB does not write it immediately to disk. Instead a delta is written to the transaction log (see below).
These modified-only-in-memory pages are typically reffered to as dirty pages. Writing down dirty pages happens as part of InnoDB’s background process which continually performs work every 1 second, and it adjusts well to peaks and troughs in load with a feature introduced in MySQL 5.5 and further improved in MySQL 5.6 called adaptive flushing. However, if there are too many dirty pages and there is not enough free space, InnoDB may be forced to synchronously flush a dirty page to make free space.
This delaying allows InnoDB to perform fewer random writes, since there will usually be multiple changes to the same page, or by default pages in the same extent (1MB allocation unit on disk). MySQL also 5.6 allows you to disable this for SSDs.
(Note: Domas makes a comment about the doublewrite buffer below, and how I might be over-selling request merging 🙂 It’s outside of the scope covered here, but described in Facebook’s blog and the MySQL manual).

Writing the transaction log

Any time you modify an InnoDB page a record of this modification is made to InnoDB’s transaction log system (ib_logfile1 and ib_logfile0 on disk). This transaction log is designed to be sequentially written, and unless the server crashes the server will only ever write and not read from it.
The log files are cyclical by design. You can think of them as a concatentation of two (by default) that behave similar to tread on a tank. So the first one is filled, then the second, then InnoDB will start filling the first again. The writes to the transaction log will be 512B aligned and using the operating system’s buffered IO. InnoDB actually relies on the operating system here to provide a cache of the log (which is not guaranteed). Since these writes are 512B – and a typical page is 4K what will happen without enough cache, is that the page will neeed to be read first, to make the partial-modification, then written back.
To reduce log file IO, InnoDB also has a log buffer (configured by innodb_log_buffer_size) that is filled with pending modifications which will be written and synced on any COMMIT by default.

Writing the slow log and general log

The general log is written out as each event is generated. It will be a sequential write, and is never synced or never read back (should be fairly low cost).
The slow query log is written out after query execution has finished. Similiar to the general log, it should also be very log cost.
Why I say “should” is that this also depends on filesystem choice. An fsync under ext3 does not flush one file, but all files. This has become known as the great fsync() bug. Also, the operating system will usually have a limit the amount of modifications queued up waiting to be written, so too much log activity could cause synchronous writing and impact other operations (see Domas’ comment below).

IO on a Replication Master (aka binary log enabled)

The binary log is sequential IO written to as part of a COMMIT operation, but by default it is never flushed, so it is not crash-safe! It requires setting sync_binlog=1.
The binary log does not have its own in memory buffers, so it’s possible if an out of date slave came online that the master server may need to read old binary log files in order to send previous server events. However in practice the master pushes changes to online slaves as they are committed so it should be write-only.
The server maps changes in the binary log to InnoDB’s transaction log to ensure consistency by default.

IO on a Replication Slave

The relay logs are in the same format as the binary logs, but the IO semantics are a little different. The slave’s IO_THREAD retrieves new events from the master and writes sequential IO relay log files on the slave. Then the SQL_THREAD then reads the changes from the relay logs and applies them. This read operation does not have any caches inside of MySQL to ensure no IO is performed, and instead relies on the operating system’s buffered IO to assist. It’s possible to decrease the amount of relay logs kept on the slave to assist in a better ‘cache fit’, but this also means that if the master fails there may be changes the slave was not aware of.
Writing the relay logs is not crash-safe by default, and requires setting sync_relay_log=1.
Also, the status files which maintain a mapping between the relay logs and master’s binary logs (which are not a 1:1) are also not crash-safe. A new option in MySQL 5.6 allows you to use an internal table to maintain these status files for added safety, but it is not enabled by default. The new options are master-info-repository and relay-log-info-repository.

Summary

I am sure I made a couple of simplifications (for example not describing the doublewrite buffer), but hopefully it’s a good start. There are a couple of key take away points I want to make:

  • Reads are foregrounded (except read-ahead) and don’t provide the same opportunity reorder and merge – so that makes them typically random IO and very expensive. If you run a diagnostic command like iostat and you see a lot of reads on a warmed up server – this is something adding memory can usually solve.

  • Most writes are sequential or backgrounded. Both InnoDB and the operating system try to produce sequential IO and merge requests.

  • While InnoDB has always been crash safe by default, MySQL has not. There are new features in MySQL 5.6 which makes replication more reliable.

When does MySQL data get loaded in and out of cache?

A cold cache, or a poorly tuned cache can be responsible for a number of performance problems. If we look at the data and indexes of InnoDB, the cache responsible is called the InnoDB buffer pool.
In earlier versions of MySQL the default value for the setting innodb_buffer_pool_size was 8M – which is a little out of date when you consider the recommended value to be 50-80% of system memory. In MySQL 5.5 the default value was increased to 128M, which is a comprimise made for users that may install MySQL on a development machine and not have it running as a dedicated server. In production, it is unlikely that you will buy a new server with less than 64GB of RAM, so it is quite typical that this setting is 50GB+
So lets talk about the behaviour of the InnoDB buffer pool –

Up until and including MySQL 5.5

When MySQL starts up, the buffer pool is completely empty. As queries are executed, MySQL will determine which pages are required – and if they are not in memory, they will be loaded from disk.
Initially, performance will be very poor – since there will be 100% cache misses, but over time as the buffer pool fills this should reduce.
Provided there is enough memory – MySQL will just keep loading pages into cache, and warm to a state where it will not have to perform any reads to the disk at all. Only writes will go to disk.

From MySQL 5.6 onwards

Because server warm time is becoming a more serious problem (it always existed, but it is exacerbated by us now having an abundance of RAM, but if we are using hard drives, they are not really any faster), Oracle introduced a feature in MySQL 5.6 – buffer pool warming.
What it does is on demand or on shutdown saves the addresses (space_id + page_id) of InnoDB pages to permanent storage.
Upon startup, these pages can then automatically be loaded back into memory straight away so that we can have cache misses for less time.
This feature is not turned on by default, but I suspect in a future version it may be. There are a number of other advantages to pre-warming such as being able to sort and merge read requests and load the data in much faster. Also, since it is only the addresses of pages being saved it only takes 8 bytes to point to each 16KB page, and there are no risks if there have been modifications since the last address saving operation ran.

Cache evictions

So we’ve described the “best case” so far, which is that data only gets loaded into cache and performance keeps getting better and better. However in practice, our cache is never unlimited. When our buffer pool gets to the point of being completely full it will need to free space in order to load new pages into memory. The behavior of this ‘eviction’ is version specific.

Up until MySQL 5.5

InnoDB previously used a classic Least Recently Used (LRU) algorithm. What this means is that each page has an order in a list. I like to think of it as a “Hot or not” score. Each time a page is accessed it gets promoted up the list to be considered “more hot”. When it comes to freeing space, InnoDB just picks the least hot page and frees it from memory to make space for the next page.
In typical operation MySQL will not free pages from memory unless it needs to make space for another page. I say ‘typical’, because pages will be freed if a table is dropped for example.

MySQL 5.5 onwards

There are a number of operational limitations with using a classic LRU implementation on a database which has many users, and typically more than one workload. For example lets say that everything is working fine and we’ve figured out what is “hot or not”, noting that we do not need as much memory as we do data because in most cases there will be pages that are not accessed frequently. Now imagine that a series of queries come in from mysqldump that want to run tablescans to export the data. What can happen, is the pages loaded in from the tablescans push out some of the good pages in our buffer pool, and even worse, as soon as the mysqldump operation is complete, they will no longer be required. So now post-mysqldump we have random IO as we try and re-settle cache-contents again.
So in MySQL 5.5 the LRU was split into two sublists:
– A ‘young’ sublist for our frequently accessed pages (default: 63%)
– An ‘old’ sublist for our table scan page accesses (default: 37%)
The configuration variable innodb_old_blocks_pct was introduced to configure the old/new split, and a variable innodb_old_blocks_time (default in 5.5: 0, in 5.6: 1000) was introduced to specify a minimum amount of milliseconds that a freshly loaded page must stay in the old sublist before it is ellible to be promoted to the young sublist.
As mentioned in the manual, and in blog posts it frequently makes sense to configure the innodb_old_blocks_time to a larger value, for example 1000 (ms).

A note on page churn

If you do not have enough memory to hold all of your working set in memory, what will happen is that the buffer pool will start juggling as it churns pages out of memory only to load them back in soon after. A small amount of this is probably okay, but it can start to degrade performance. The traditional rule of thumb is called “the 5 minute rule“. This means that if you load a page into memory and are going to need it again in the next five minutes – it should not need to be churned out.

OPTIMIZE/CHECK/REPAIR/ANALYZE TABLE InnoDB Edition

I find a good interview question for a MySQL DBA position is to ask what the following commands actually do in InnoDB, which has been the default storage engine since MySQL 5.5. From my perspective there is a lot of miss-understanding what still applies.

ANALYZE TABLE

From the MySQL manual:

ANALYZE TABLE analyzes and stores the key distribution for a table. During the analysis, the table is locked with a read lock for InnoDB and MyISAM.

What this means is, as part of query optimization MySQL will often have to decide which is the best index if there are multiple candidates, which indexes should be avoided, and what order should tables be joined in. Indexes need to eliminate work – so if for example you were trying to index a column called “Country” in a table full of all people in the USA, then it would be faster to avoid that index.
What is also important to note, is that InnoDB will often update these statistics internally without you needing to do so, and that the ‘read lock’ that the manual describes is kind of weird (shameless plug for a bug I filed back in 2007).
Up until 5.6, statistics are in memory only, and very coarse – sampling just a small amount of data. With 5.6, there is now the new Persistent Optimizer Statistics enhancement as well as innodb_stats_auto_recalc is also a variable, and STATS_PERSISTENT, STATS_AUTO_RECALC, and STATS_SAMPLE_PAGES are configurable per table as CREATE TABLE options. It’s a shame features like this don’t get enough press.
I still find myself using ANALYZE TABLE as part of debugging slow queries to confirm that it is not a stale causing a problem, but I don’t find any routine operational use case for it.
Pro-tip: My explaination above with a “Country” table being filled with all “USA” was a bit of a simplification, since not all optimizer decisions will use these pre-computed statistics. I believe in many cases if we are just talking about a single table (and not trying to determine join order), the optimizer will just ask InnoDB to estimate how many records there are in a given range. I am hoping these sort of behaviors will be more exposed in future MySQL versions. I was very happy to see EXPLAIN for UPDATE and DELETE statements in MySQL 5.6 – and in some cases the new JSON EXPLAIN format seems to show more information.

CHECK TABLE

CHECK TABLE to me is a MyISAM-ism. As noted in the manual:

If CHECK TABLE finds a problem for an InnoDB table, the server shuts down to prevent error propagation. Details of the error will be written to the error log.

InnoDB is an ACID compliant (durable) data store, so it doesn’t have the same inconsistency situations that MyISAM does. The behavior that the manual is describing here is the same behavior that happens if the server detects corruption through standard operation – which is detected via CRC checksums on each page it stores.
As Oli writes check table probably won’t work on very large tables (> 200-400 GB), so this command does not really have any practical use for me.

REPAIR TABLE

From the MySQL manual:

REPAIR TABLE only applies to MyISAM, ARCHIVE, and CSV tables. See Section 14.3, “The MyISAM Storage Engine”, and Section 14.6, “The ARCHIVE Storage Engine”, and Section 14.5, “The CSV Storage Engine”

So no use for this either.

OPTIMIZE TABLE

For InnoDB, the wording in the in the MySQL manual and the error message is very specific on this one:

For InnoDB tables, OPTIMIZE TABLE is mapped to ALTER TABLE, which rebuilds the table to update index statistics and free unused space in the clustered index. Beginning with MySQL 5.1.27, this is displayed in the output of OPTIMIZE TABLE when you run it on an InnoDB table, as shown here:

mysql> OPTIMIZE TABLE foo;
+----------+----------+----------+-------------------------------------------------------------------+
| Table    | Op       | Msg_type | Msg_text                                                          |
+----------+----------+----------+-------------------------------------------------------------------+
| test.foo | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| test.foo | optimize | status   | OK                                                                |
+----------+----------+----------+-------------------------------------------------------------------+

So what this means, is that internally we create a new table – much like the existing table, then we trickle load the data into it one row at a time. For the clustered index (aka primary key) this may result in a significant space saving if data was inserted out of order, or if there have been modifications which have caused there to be some gaps which have affected fill-factor. It’s important to note while explaining this, that some gaps are expected – as InnoDB only fills pages 15/16ths full.
For the secondary key indexes they will be trickle loaded one row at a time in the order of the clustered index – which may result in them going straight back to being fragmented anyway. InnoDB’s implementation of MVCC does have multiple versions in the secondary indexes however, so it is possible that gaps may be reclaimed here.
In MySQL 5.5, InnoDB introduced a feature called “fast index create” that could create these secondary indexes more optimally by presorting the data first and then creating the index. However, this feature is not tied into OPTIMIZE TABLE (yet) in official Oracle MySQL releases. See Bug #57583.
Due to it’s massively high cost, I find I run OPTIMIZE a lot less than other people (read: almost never), and Baron Schwartz has even has a humorous way of describing it:

.. like something you’d hear from a naive Windows user who buys a $99 piece of software to make his PC “boot faster” or “fix his registry” or something.

Again, this post being related to InnoDB – OPTIMIZE was important for MyISAM Dynamic tables. Read the manual for more.

Fastest way to estimate rows in a table

A friend wrote to me recently with a question. He was working on a method to ship application metrics to statsd on a 1 minute interval. He had three examples of how to estimate the number of rows in a table and he wanted to know the difference between them.

Data length/average row length

{% raw %}
The example given:

mysql> select DATA_LENGTH/AVG_ROW_LENGTH from INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'line_items';
+----------------------------+
| DATA_LENGTH/AVG_ROW_LENGTH |
+----------------------------+
|              10497541.7528 |
+----------------------------+
1 row in set (0.03 sec)

I have actually never thought of using this method! I don’t think it’s accurate though, since data length has deleted space + additional preallocated or overhead space. For example a page file is only 15/16ths in InnoDB. So as you can see the number it returns is just over 10% higher than the actual number of rows (9441296).

Table rows from Information Schema

The example given:

mysql> SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'line_items';
+------------+
| TABLE_ROWS |
+------------+
|    9800078 |
+------------+
1 row in set (0.03 sec)

This method takes a number that InnoDB provides, which in this case is accurate to within 4% of the actual number of rows (estimating slightly over in this case). So the question is really about how efficient a count already provided is, and where does it come from.
InnoDB samples a number of random pages in the table, and then estimates the total rows for the whole table. By default this sampling is just 8 pages in MySQL 5.5 (configurable via innodb_stats_sample_pages), but is greatly improved in MySQL 5.6 – with 20 pages sampled by default. The option is now called innodb_stats_persistent_sample_pages – a reference to the new persistent statistics feature!
So based on it being a fixed number of pages to examine, it is also going to scale reasonably with table growth. Pro tip: It is quite possible it may look much slower as soon as the table does not fit in memory, since 8 random pages could mean > 8 random IOs.

Select count(1)

The example given:

mysql> SELECT COUNT(1) FROM line_items;
+----------+
| COUNT(1) |
+----------+
|  9441296 |
+----------+
1 row in set (2.03 sec)

This requires an index scan of the primary key. It’s important to explain why that is, since this behavior differs from MyISAM. InnoDB supports MVCC which is an important feature to allow concurrent access to rows without having to need readers set locks blocking other users from writing. In a practical sense what this feature means, is that at any one point in time there will be multiple versions of a row. The actual count(1) will depend on the time your transaction started, and its isolation level.
This solution will not scale well as the number of rows in the table grows and while storage-engine development is outside of my expertise, I suspect it is unlikely that this will be improved in any future MySQL versions. My reasoning is that I can not see an easy way of maintaining multiple different pre-computed counts without introducing any new global locking or overhead – which is a big no-no in being able to scale on multiple cores/cpus.
Finally, of the three solutions, this is the only 100% accurate method to be able to tell the exact number of rows in the table.
{% endraw %}