Deciding whether or not to make MySQL durable

Today I had intended to talk about my approach and thought process in configuring a fresh MySQL server running 5.6 with the InnoDB storage engine. However, it didn't quite work out that way. I managed to get side-tracked by decision #0 - whether or not durability is important.

Introduction

By default, MySQL is configured to lie to you. That is that when you modify data there are no guarantees that your changes will be able to survive a catastrophic failure, or to put in simple terms: a power loss. The advantage of doing this, is MySQL is able to perform additional buffering and merging and consequently achieve better performance. However, the downside is also very clear: losing data is problematic, so you need to decide what is appropriate for your use-case.

I tend to argue that there are only two options of durability (yes or no), since you do not get to decide which transaction you lose - it could be a $10 order or a $10 million dollar order. Once you have decided you can potentially lose one transaction, in reality you can probably lose a few more.

I also warn strongly that there is no point in making MySQL durable if you can not make durability guarantees with the layers below MySQL. MySQL is not alone in lying by default - consumer hard drives have buffers which are not battery backed, as do virtualization platforms.

Durable

Frequently you will hear the example:

"We don't process credit cards so we don't need to be durable".

I tend to disagree with this, as it is too simplistic to be useful. Order processing and user management are two equally valuable examples which require durability. For example: I don't want to revoke someone's credentials only to find with a power loss my change did not take effect! I also do not like that this example omits the truth that lost DBA/developer productivity can be more expensive than hardware. Many businesses do not want to be in a situation where a failure requires post-failure investigation to handle any discrepancies.

I would also mention that the cost of providing the added IOPS durable workloads create is much lower now that we have the ability to install SSDs on our database servers. If the required performance can be achieved with durable or not durable, losing durability is a sub-optimization, not a trade-off.

So having discussed it, here are the settings to change MySQL to be durable:

  • sync_binlog=1

For slaves:
- sync_relay_log=1
- relay-log-info-repository=TABLE
- master-info-repository=TABLE

or:
- sync_relay_log=1
- sync_relay_log_info=1
- sync_master_info=1

I recommend using the first option for slaves, which will use an InnoDB table internally. This is new to MySQL 5.6.

Non-durable

It's best to describe this by case study. In the case of inPowered, we have multiple database servers, but by enlarge we are non-durable. I argue that this is the right choice for a few reasons:

  • We do not have much in the way of user data
  • Most of the data can be regenerated by re-scraping web articles
  • Most of the data is valuable in aggregate (tracking likes/shares of articles) so losing a portion is not critical.
  • I have no reason to believe that Amazon EC2 servers provide durable writes (see introduction), nor do I have the ability to do a true pull the plug test.

Settings to change to be non-durable:

  • innodb_flush_log_at_trx_commit=2

There are fewer settings to change when deciding to be non durable, since this is the default.

Conclusion

I confessed that I am currently managing a non-durable installation, but you may detect from my tone that I am by and large a durability proponent. I do not necessarily agree with 'lie by default', but nor do I hold it against MySQL - it is a common choice amongst databases vendors. I also think that users are often heavily swayed by naive benchmarks, so it is hard to take a high moral ground otherwise.

Did I miss anything? Please leave a comment.

Converting MyISAM to InnoDB and a lesson on variance

I'm about to start working through converting a large MySQL installation to InnoDB. For the particular use case durability is desired, and with MyISAM a loss of power almost certainly guarantees data loss. Durability is not available as an option.

So the normal question people ask is.. okay, what do I pay for this feature?

I think if you are looking at the workload in question I actually believe InnoDB will perform better overall. Row-level locking, multiversion concurrency control, and the correct selection of hardware (fast-disks with a RAID controller + batter backed write cache) will really help it shine.

But my problem is that InnoDB could be better in 9 situations, and worse in 1. What matters more than anything else is performance regressions. That is...

"Our Customers Feel the Variance, Not the Mean" - a key concept in Six Sigma.

The way I like to explain this, is that if we were to go to a restaurant and order the same meal - or go to Toyota and order the same car, we should receive the same product. I can't get the equivalent to a race car and you get a Trabant. Nor can we allow some queries that were already fast enough get faster, and other queries become intolerably slower.

So what I am actively trying to do is identify situations where MyISAM will be faster than InnoDB. And so far I have on my watch list:

  1. Table Scans - MyISAM data is far more compact, and typically table scans need to examine far less data.
  2. Index Scans - In InnoDB, indexes will contain multiple versions and typically be much larger.
  3. Inserts - MyISAM has a special optimization where it can just insert into the end of a table. I think even with the battery backed write cache this will be hard to beat.
  4. Single-threaded workloads - maybe there's an hourly cron that in InnoDB will take significantly longer. InnoDB really performs better in concurrent workloads.

Facebook and Percona have previously written about the importance of variance reduction as well.

I should also note that durability is not the only reason to use InnoDB. Operational tasks like backups become much easier. But that's the subject for a different post :)

Estimating MySQL's Working Set with information_schema

In an earlier post, I wrote about when MySQL loads data in and out of cache. At the end of this post, I talked about the concept of page churn, and having enough memory for your working set:

"If you do not have enough memory to hold all of your working set in memory, what will happen is that the buffer pool will start juggling as it churns pages out of memory only to load them back in soon after. A small amount of this is probably okay, but it can start to degrade performance. The traditional rule of thumb is called "the 5 minute rule". This means that if you load a page into memory and are going to need it again in the next five minutes - it should not need to be churned out."

To be able to calculate this in a precise way we would need to collect:
- The current number of pages in the buffer pool.
- The unique pages loaded in over a 5 minute period.

MySQL doesn't have a hook to be able to tell when a page is loaded in memory, but as of MySQL 5.6 it does have an information_schema table called INNODB_BUFFER_PAGE. So while we will not be able to get the exacty number, polling this table on a frequent enough interval and storing the unique results (space_id + page_id) to a temporary table should get close.

So I wrote a stored procedure estimate_working_set.sql to do exactly this. It accepts two arguments; sleep time and iterations. Here is an example ~5 minute observation on my laptop:

mysql> call test.estimate_working_set(10, 30);
+----------+
| progress |
+----------+
| 1/30     |
+----------+
1 row in set (10.13 sec)

+----------+
| progress |
+----------+
| 2/30     |
+----------+
1 row in set (21.19 sec)

.. lines ommitted for brevity ..

+----------+
| progress |
+----------+
| 29/30    |
+----------+
1 row in set (5 min 41.97 sec)

+----------+
| progress |
+----------+
| 30/30    |
+----------+
1 row in set (5 min 54.72 sec)

+----------------------+
| pages_in_working_set |
+----------------------+
|               100679 |
+----------------------+
1 row in set (5 min 55.61 sec)

Query OK, 0 rows affected (5 min 55.71 sec)

So in my case my working set is 100679 pages. I am not using compressed pages so each page is 16 KiB, or divide by 64 to convert to MiB. 100679/64 = 1573MB working set. I have a 128M buffer pool - so I desperately need to increase it / add more memory.

Warning! This stored procedure will be expensive on a server with a large buffer pool. Use at your own risk!

Optimizing IN Subqueries in MySQL 5.6

I thought I would take the new subquery optimizations in MySQL 5.6 for a spin today, using the world sample database provided by MySQL for certification and training.

Typical IN subquery

This is a very typical query developers run, which historically has performed very poorly on MySQL:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Notice that in MySQL 5.6 - the very first table accessed is Country instead of City. MySQL 5.5 was not able to recognize this as a constant, and instead executed this as a DEPENDENT SUBQUERY (aka Correlated subquery) for each row it found in the city table (an estimated 3984 rows)!

MySQL 5.6 still has a table scan on Country, but I can address that with an index on Country.name:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY,Name
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

mysql5.6.11 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ref
possible_keys: PRIMARY,Name
          key: Name
      key_len: 52
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

The index doesn't affect MySQL 5.5 - which still executes as a DEPENDENT SUBQUERY, but take a look at MySQL 5.6 - 1 row from the Country table (from an index!) and then 1 row from the City table. This optimizes great!

More complex IN example

In this example I thought I would try to find all cities in the country with the largest population. My first attempt was to see if I could now use a LIMIT in a subquery. It looks like I'll have to wait a bit longer:

mysql5.5.31 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

mysql5.6.11 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So here is my second attempt:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN 
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 219
        Extra:
3 rows in set (0.00 sec)

mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN 
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: NULL
3 rows in set (0.00 sec)

MySQL 5.5 could always optimize the population = scalar subquery, but not the IN subquery. Similar to the above example, I would expect the subqueries here should be unraveled as constants as well. If I add an index on population you can really see this happen:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN 
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY,Population
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)

mysql5.6.11 > ALTER TABLE country add index (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql5.6.11 > EXPLAIN select * from City WHERE CountryCode IN 
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ref
possible_keys: PRIMARY,Population
          key: Population
      key_len: 4
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)

This is looking really good in MySQL 5.6. I had a bit of a huh? moment when trying to read what the Select tables optimized away step #3 meant. This led me to try using EXPLAIN EXTENDED where I discovered a little gem:

mysql5.5.31 > SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`city` where
 <in_optimizer>(`world`.`city`.`CountryCode`,
 <exists>(<primary_index_lookup>(<cache>(`world`.`city`.`CountryCode`) in 
 country on PRIMARY where ((`world`.`country`.`Population` = 
 (select max(`world`.`country`.`Population`) from `world`.`country`)) and 
 (<cache>(`world`.`city`.`CountryCode`) = `world`.`country`.`Code`)))))
1 row in set (0.00 sec)

mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`country` join `world`.`city` 
where ((`world`.`city`.`CountryCode` = `world`.`country`.`Code`) 
and (`world`.`country`.`Population` = (/* select#3 */ 
select max(`world`.`country`.`Population`) from `world`.`country`)))
1 row in set (0.00 sec)

EXPLAIN EXTENDED writes the approximate query that MySQL is going to execute after the optimizer has applied any optimizations or transformations. This has been enhanced in MySQL 5.6 to add a comment for each step in the query execution (IDs match up to those in EXPLAIN). So if it was ever unclear, it is clearly this portion that is being optimized away:

mysql5.5.31 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)

mysql5.6.11 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)

I believe what's happening here, is during optimization MySQL opens the index population and looks at the last value (very cheap on a B-tree). So it kind of cheats and does some of the work before it has to. I've seen it do this before, here is a more common example of this cheating happening:

mysql5.5.31 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra:
1 row in set, 1 warning (0.00 sec)

mysql5.5.31 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS `CountryCode`,
'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` where 1
1 row in set (0.00 sec)

mysql5.6.11 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)

mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS
 `CountryCode`,'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` 
WHERE 1
1 row in set (0.00 sec)

Anyway, back to my original query. With the nesting of my IN queries I sometimes find it difficult to read the output of EXPLAIN and understand the order of execution. MySQL 5.6 also has FORMAT=JSON, which looks much nicer to me and it includes more information:

mysql5.6.11 > EXPLAIN format=json select * from City WHERE CountryCode IN 
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "nested_loop": [
      {
        "table": {
          "table_name": "country",
          "access_type": "ref",
          "possible_keys": [
            "PRIMARY",
            "Population"
          ],
          "key": "Population",
          "used_key_parts": [
            "Population"
          ],
          "key_length": "4",
          "ref": [
            "const"
          ],
          "rows": 1,
          "filtered": 100,
          "using_index": true,
          "attached_condition": "(`world`.`country`.`Population` = (/* select#3 */ 
          select max(`world`.`country`.`Population`) from `world`.`country`))",
          "attached_subqueries": [
            {
              "dependent": false,
              "cacheable": true,
              "query_block": {
                "select_id": 3,
                "table": {
                  "message": "Select tables optimized away"
                }
              }
            }
          ]
        }
      },
      {
        "table": {
          "table_name": "City",
          "access_type": "ref",
          "possible_keys": [
            "CountryCode"
          ],
          "key": "CountryCode",
          "used_key_parts": [
            "CountryCode"
          ],
          "key_length": "3",
          "ref": [
            "world.country.Code"
          ],
          "rows": 1,
          "filtered": 100
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.00 sec)

While it's possible that these queries could have been rewritten to be efficient joins, I really like seeing query optimizations being introduced to eliminate common paper cuts. Improving diagnostic features doesn't hurt either ;) I'm really looking forward to what tools can be built to take advantage of the JSON explain output.

Review of SQL Performance Explained by Markus Winand

I picked up a copy of SQL Performance Explained last week, after having been a long time fan of Markus' site Use The Index, Luke!.

What I love the most about use-the-index-luke, is the 3-minute-test. Seriously - try taking it!

Anyway, here is what I have to say about the book:

SQL Performance Explained
  • Quality : The book is not published by a major publisher, so I was not sure if I should expect a bound stack of photocopies. To my delight, it is the same quality as any other book, and clearly has a copy editor. I didn't notice any gramatical or copy errors.

  • Length : It is a short book at 164 pages + 28 pages of appendixes. I personally love this - because the author has focused on what is important, and not added introductory chapters at the pressure of a publisher to expand page count. I kept wondering if it could be smaller (in an effort to make sure every developer on a team reads it), but I don't think this is reasonable either. I thought it was smart in the preface to say that the book only covers B-tree indexes, and the length was therefor "about right".

  • Content : Very high quality content - it is simply amazing that one author has this depth of knowledge on 4 different database vendors (Oracle, SQL Server, MySQL and PostgreSQL). I like the organization which is largely around statement level behavior, and keeps it a developer-focused book. I also like that the first chapter dives right into the thick of it by describing the anatomy of an index. The theory aspect of this chapter may be a little bit difficult for developers with less of a low-level background to grasp in one read - but I couldn't think of any other way around it. It covered a good/appropriate level of detail.

  • Small Criticisms : Sometimes it is difficult to be generic in advice, and keep in mind that this book covers Oracle, SQL Server, MySQL and PostgreSQL. I think the book did pretty well - but I did notice an example in Chapter 2, where it describes an index range scan (p39) that I know is not possible until Index Condition Pushdown in MySQL 5.6. It did warn that composite indexes should be in the form const, range but for MySQL the wording probably should have been a little stronger. I don't want you to take this criticism too seriously as well. If the author were to give database specific and version specific advice, it would make for a very dull read.

Overall:
As the author himself says, SQL does a good job at separating out what data is needed from how data will actually be retrieved. This makes performance characteristics not always clear cut.

I think that for a development team building database applications everyone understanding these basic characteristics is like a chef knowing how to cook an omelette - it is such a critical fundamental that will never go away.

If there are developers that aren't getting 5/5 on the Use the Index, Luke! 3-minute-test, then I can't think of a better resource to get them there. The book is a short read, and to the point!

Final rating: 4.5 stars.

Buy the book online from sql-performance-explained.com.

Experimenting with MySQL 5.7

I was playing around with MySQL 5.7 this weekend and before having read the changelog, I managed to spot these two little gems.

Duplicate Indexes

"The server now issues a warning if an index is created that duplicates an existing index, or an error in strict SQL mode." Bug #37520

Example Testcase:

mysql> SHOW CREATE TABLE city\G
*************************** 1. row ***************************
       Table: city
Create Table: CREATE TABLE `city` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Name` char(35) NOT NULL DEFAULT '',
  `CountryCode` char(3) NOT NULL DEFAULT '',
  `District` char(20) NOT NULL DEFAULT '',
  `Population` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`ID`),
  KEY `CountryCode` (`CountryCode`),
  CONSTRAINT `city_ibfk_1` FOREIGN KEY (`CountryCode`) REFERENCES `Country` (`Code`)
) ENGINE=InnoDB AUTO_INCREMENT=4080 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> ALTER TABLE city add index (countrycode);
ERROR 1831 (HY000): Duplicate index 'CountryCode_2' defined on the table 'world.city'. 
This is deprecated and will be disallowed in a future release.

Pretty cool - I know this previously caught a lot of people.

Control-C support in the client

"Previously, Control+C in mysql interrupted the current statement if there was one, or exited mysql if not. Now Control+C interrupts the current statement if there was one, or cancels any partial input line otherwise, but does not exit." Bug #66583

Example Testcase:

mysql> this is a test
    -> test
    -> test
    -> ^C

So if I want to quit, I can now control-C then type "quit". This is much more intuitive.

Twice as much RAM != Twice as Fast

We use Amazon RDS for database servers and historically have had a practice of having smaller database server in QA and DEV to save cost. I would consider this a pretty common practice - but what I have also discovered is that it can lead to fixing non-bugs, or issues that will not arrise in production.

Let me try and describe a (~) worst case and some simple math to illustrate:

Production
- 10G of RAM
- 100 IOPS Storage Device (single hard disk)
- Workload is 100% reads
- 10K queries/second

QA
- 9G of RAM
- 100 IOPS Storage Device (single hard disk)
- Same workload (100% reads)
- Can only do 1K queries/second!

Production is serving its 10K queries/second with a 100% cache hit rate, but with the 1G less RAM, QA is only able to serve with a 99% cache hit rate. Each query touches on average 10 distinct pages. So:

10K queries * 10 pages
= 100K pages/second

100K pages * 0.01 cache miss
= 1OOO IOPS second required to cover misses.

.. since we only have 100 IOPS, we can only serve 1/10th of the queries!

In our case it was more like production having 68G RAM and QA having 34G. The query volume itself was also not necessarily as high, but the queries all needed to access a very large number of distinct pages - causing the same cache miss, and page churn.

What MySQL is really good at!

With databases, choices of algorithms influence performance characteristics significantly. It is often very easy to find a situation where one database will perform much worse than others, which is why you will hear database engine developers cry out that benchmarks are flawed.

The best benchmark is one that closely matches what your application actually does, which is why you see the TPC create benchmarks to match hypothetical situations - like a warehouse that has inventory arriving and being shipped out all the time. These situations in turn create "workloads" on the database. To give some context, a workload may perform differently on a different database engine because of how many concurrent modifications are required, the ratio of reads/writes, how much data is modified in a transaction, and where in the data set the reads and writes are (are there hot records or hot tables?). These are just examples - in reality there are too many factors to list.

Which gets me to my main point - what workloads is MySQL with the InnoDB storage engine really good at? In answering this, I really want to try to focus on core data structures and algorithms more than specific features that may or may not be implemented.

Total data size may exceed memory but working set does not

If you have for example 100G of data, you do not necessarily require 100G of RAM. MySQL will adjust and keep only the most frequently accessed data in memory (aka working set). This behavior is the result of using a B+Tree index. To simplify and compare this to say memcached where the index is a hash - it really requires as much memory as there is data.

(Note: Memcached of course addresses this by allowing you to distribute the index across many nodes, and hash indexes are generally smaller, so please don't miss the key point I am making here.)

Modifications are small and transactions are short

When you modify a row in InnoDB, the old version of it is relocated to undo space, and the 'new' version is written in place (implementing MVCC). When all existing transactions have finished with the old version of the row, along comes a purge process.

Not all databases implement MVCC this way. For example another way of implementing MVCC is to just append/write new rows in the same table-space. The advantage of the relocation method is that if there is sufficient memory to keep the undo pages in memory (and they are not required for very long) is that this relocation may be a logical relocation only. i.e. physical IO relocation doesn't have to happen. However, in the case that there is not enough memory/the undo pages need to be evicted from the buffer pool, then I could see this potentially taking more effort. I believe rolling back large transactions may be more effort with InnoDB's design as well - although I have not put much thought into it.

Primary key lookups are fast

InnoDB implements a Clustered Index on the primary key. Clustering in this case refers to a close arrangement/organization on disk, not multiple compute instances working together. This makes primary key lookups in InnoDB particularly fast - although admitedly probably not as fast as databases that impliment a hash index.

Indexed ranged lookups are supported

Some index structures are suitable for fixed primary key lookups, or emulate other types of access by pre-computing a cache of values and packing them in one row/column. Because InnoDB uses a B+Tree index, it is very flexible in being able to support ranged lookups on primary or secondary keys (example: find rows BETWEEN x AND y). These indexes are also pre-sorted, which (depending on the query) can allow the rows to be efficiently delivered as they are read rather than buffered first.

Query performance is highly uniform

InnoDB is optimized to provide stable response times to user-serving applications, rather than just peak throughput in queries per second. There are a number of features which help deliver this: converting as much IO as possible to be in the background with the transaction log files, adaptive flushing to make sure that background work doesn't fall behind, even the implimentation of MVCC I described above.

InnoDB also supports covering indexes, which means that for queries that do not SELECT * they sometimes have a chance to retrieve all data from the secondary index. For some cases where not all data fits in memory covering indexes can be super critical to uniformity of performance because the number of unique pages that will need to be looked at is considerably lower.

Table has only a primary key and non-unique secondary keys

So I mentioned already that total data-size can exceed main memory, but InnoDB also has a novel mechanism to delay INSERT/UPDATE/DELETE modifications to secondary indexes if the pages are not loaded into memory. What it will do is buffer the changes in a secondary structure, and then make the modification the next time the page is loaded into memory, or in the background provided the server is not under load. This feature only works when index is not unique, because otherwise it is not safe to make a modification not knowing if it violates a constraint. Ideally, the primary key also inserts data in order.

(In previous versions of InnoDB, this feature was described as the insert buffer).

..

What am I missing? Please try and focus on the algorithms not the features, and the positive not the negative!

When does MySQL perform IO?

In my previous post, I wrote about when data gets loaded in and out of cache. The goal of caching is to avoid having to perform disk IO, which if we are talking about hard drives, each operation is about 20 million times slower than CPUs or about 100 thousand times slower than memory.

So today I want to cover the subject of when are we required to do IO? But before I get there I first have to describe how IO works on all our modern operating systems.

An Introduction to Buffered IO

When you read or write to a file, by default you are speaking to a layer in between you and your disks called buffered IO. This layer is designed to increase performance and adds important features such as caching. Without going into this in detail, it's important to note that:

  • When you READ data, you may be reading from the disk or a cache provided by the operating system.

  • When you WRITE data you are not writing to the disks, but to a queue which will be flushed in the background. This queue is a little bit different to how a message queue in an application may behave: it usually has a hard limit to the number of outstanding requests, and it does not always complete requests in purely FIFO order - but an optimized order that still makes sure there is a maximum time window a request can take (example).

    • WRITE operations are not crash-safe by default! The operating system does provide a way of syncing a file to make sure all of the buffers are flushed down to the disk; an fsync.

    • An fsync is usually very expensive because it forces the queue to empty, and reduces the amount of request reordering and merging. Sequential IO is really important for hard drives.

So now lets look at IO created by a MySQL Server using the InnoDB storage engine (default since MySQL 5.5).

Reading pages into memory

If a page is not located in cache already, InnoDB will load it into memory in the foreground while queries are waiting, unless the page loading was triggered via a read ahead operation in which case it will use one of its innodb_read_io_threads to make this happen in the background.

Writing modified pages back to disk

When you make a modification to an InnoDB page, InnoDB does not write it immediately to disk. Instead a delta is written to the transaction log (see below).

These modified-only-in-memory pages are typically reffered to as dirty pages. Writing down dirty pages happens as part of InnoDB's background process which continually performs work every 1 second, and it adjusts well to peaks and troughs in load with a feature introduced in MySQL 5.5 and further improved in MySQL 5.6 called adaptive flushing. However, if there are too many dirty pages and there is not enough free space, InnoDB may be forced to synchronously flush a dirty page to make free space.

This delaying allows InnoDB to perform fewer random writes, since there will usually be multiple changes to the same page, or by default pages in the same extent (1MB allocation unit on disk). MySQL also 5.6 allows you to disable this for SSDs.

(Note: Domas makes a comment about the doublewrite buffer below, and how I might be over-selling request merging :) It's outside of the scope covered here, but described in Facebook's blog and the MySQL manual).

Writing the transaction log

Any time you modify an InnoDB page a record of this modification is made to InnoDB's transaction log system (ib_logfile1 and ib_logfile0 on disk). This transaction log is designed to be sequentially written, and unless the server crashes the server will only ever write and not read from it.

The log files are cyclical by design. You can think of them as a concatentation of two (by default) that behave similar to tread on a tank. So the first one is filled, then the second, then InnoDB will start filling the first again. The writes to the transaction log will be 512B aligned and using the operating system's buffered IO. InnoDB actually relies on the operating system here to provide a cache of the log (which is not guaranteed). Since these writes are 512B - and a typical page is 4K what will happen without enough cache, is that the page will neeed to be read first, to make the partial-modification, then written back.

To reduce log file IO, InnoDB also has a log buffer (configured by innodb_log_buffer_size) that is filled with pending modifications which will be written and synced on any COMMIT by default.

Writing the slow log and general log

The general log is written out as each event is generated. It will be a sequential write, and is never synced or never read back (should be fairly low cost).

The slow query log is written out after query execution has finished. Similiar to the general log, it should also be very log cost.

Why I say "should" is that this also depends on filesystem choice. An fsync under ext3 does not flush one file, but all files. This has become known as the great fsync() bug. Also, the operating system will usually have a limit the amount of modifications queued up waiting to be written, so too much log activity could cause synchronous writing and impact other operations (see Domas' comment below).

IO on a Replication Master (aka binary log enabled)

The binary log is sequential IO written to as part of a COMMIT operation, but by default it is never flushed, so it is not crash-safe! It requires setting sync_binlog=1.

The binary log does not have its own in memory buffers, so it's possible if an out of date slave came online that the master server may need to read old binary log files in order to send previous server events. However in practice the master pushes changes to online slaves as they are committed so it should be write-only.

The server maps changes in the binary log to InnoDB's transaction log to ensure consistency by default.

IO on a Replication Slave

The relay logs are in the same format as the binary logs, but the IO semantics are a little different. The slave's IO_THREAD retrieves new events from the master and writes sequential IO relay log files on the slave. Then the SQL_THREAD then reads the changes from the relay logs and applies them. This read operation does not have any caches inside of MySQL to ensure no IO is performed, and instead relies on the operating system's buffered IO to assist. It's possible to decrease the amount of relay logs kept on the slave to assist in a better 'cache fit', but this also means that if the master fails there may be changes the slave was not aware of.

Writing the relay logs is not crash-safe by default, and requires setting sync_relay_log=1.

Also, the status files which maintain a mapping between the relay logs and master's binary logs (which are not a 1:1) are also not crash-safe. A new option in MySQL 5.6 allows you to use an internal table to maintain these status files for added safety, but it is not enabled by default. The new options are master-info-repository and relay-log-info-repository.

Summary

I am sure I made a couple of simplifications (for example not describing the doublewrite buffer), but hopefully it's a good start. There are a couple of key take away points I want to make:

  • Reads are foregrounded (except read-ahead) and don't provide the same opportunity reorder and merge - so that makes them typically random IO and very expensive. If you run a diagnostic command like iostat and you see a lot of reads on a warmed up server - this is something adding memory can usually solve.

  • Most writes are sequential or backgrounded. Both InnoDB and the operating system try to produce sequential IO and merge requests.

  • While InnoDB has always been crash safe by default, MySQL has not. There are new features in MySQL 5.6 which makes replication more reliable.

When does MySQL data get loaded in and out of cache?

A cold cache, or a poorly tuned cache can be responsible for a number of performance problems. If we look at the data and indexes of InnoDB, the cache responsible is called the InnoDB buffer pool.

In earlier versions of MySQL the default value for the setting innodb_buffer_pool_size was 8M - which is a little out of date when you consider the recommended value to be 50-80% of system memory. In MySQL 5.5 the default value was increased to 128M, which is a comprimise made for users that may install MySQL on a development machine and not have it running as a dedicated server. In production, it is unlikely that you will buy a new server with less than 64GB of RAM, so it is quite typical that this setting is 50GB+

So lets talk about the behaviour of the InnoDB buffer pool -

Up until and including MySQL 5.5

When MySQL starts up, the buffer pool is completely empty. As queries are executed, MySQL will determine which pages are required - and if they are not in memory, they will be loaded from disk.

Initially, performance will be very poor - since there will be 100% cache misses, but over time as the buffer pool fills this should reduce.

Provided there is enough memory - MySQL will just keep loading pages into cache, and warm to a state where it will not have to perform any reads to the disk at all. Only writes will go to disk.

From MySQL 5.6 onwards

Because server warm time is becoming a more serious problem (it always existed, but it is exacerbated by us now having an abundance of RAM, but if we are using hard drives, they are not really any faster), Oracle introduced a feature in MySQL 5.6 - buffer pool warming.

What it does is on demand or on shutdown saves the addresses (space_id + page_id) of InnoDB pages to permanent storage.

Upon startup, these pages can then automatically be loaded back into memory straight away so that we can have cache misses for less time.

This feature is not turned on by default, but I suspect in a future version it may be. There are a number of other advantages to pre-warming such as being able to sort and merge read requests and load the data in much faster. Also, since it is only the addresses of pages being saved it only takes 8 bytes to point to each 16KB page, and there are no risks if there have been modifications since the last address saving operation ran.

Cache evictions

So we've described the "best case" so far, which is that data only gets loaded into cache and performance keeps getting better and better. However in practice, our cache is never unlimited. When our buffer pool gets to the point of being completely full it will need to free space in order to load new pages into memory. The behavior of this 'eviction' is version specific.

Up until MySQL 5.5

InnoDB previously used a classic Least Recently Used (LRU) algorithm. What this means is that each page has an order in a list. I like to think of it as a "Hot or not" score. Each time a page is accessed it gets promoted up the list to be considered "more hot". When it comes to freeing space, InnoDB just picks the least hot page and frees it from memory to make space for the next page.

In typical operation MySQL will not free pages from memory unless it needs to make space for another page. I say 'typical', because pages will be freed if a table is dropped for example.

MySQL 5.5 onwards

There are a number of operational limitations with using a classic LRU implementation on a database which has many users, and typically more than one workload. For example lets say that everything is working fine and we've figured out what is "hot or not", noting that we do not need as much memory as we do data because in most cases there will be pages that are not accessed frequently. Now imagine that a series of queries come in from mysqldump that want to run tablescans to export the data. What can happen, is the pages loaded in from the tablescans push out some of the good pages in our buffer pool, and even worse, as soon as the mysqldump operation is complete, they will no longer be required. So now post-mysqldump we have random IO as we try and re-settle cache-contents again.

So in MySQL 5.5 the LRU was split into two sublists:
- A 'young' sublist for our frequently accessed pages (default: 63%)
- An 'old' sublist for our table scan page accesses (default: 37%)

The configuration variable innodb_old_blocks_pct was introduced to configure the old/new split, and a variable innodb_old_blocks_time (default in 5.5: 0, in 5.6: 1000) was introduced to specify a minimum amount of milliseconds that a freshly loaded page must stay in the old sublist before it is ellible to be promoted to the young sublist.

As mentioned in the manual, and in blog posts it frequently makes sense to configure the innodb_old_blocks_time to a larger value, for example 1000 (ms).

A note on page churn

If you do not have enough memory to hold all of your working set in memory, what will happen is that the buffer pool will start juggling as it churns pages out of memory only to load them back in soon after. A small amount of this is probably okay, but it can start to degrade performance. The traditional rule of thumb is called "the 5 minute rule". This means that if you load a page into memory and are going to need it again in the next five minutes - it should not need to be churned out.