An easy way to describe MySQL's Binary Log Group Commit

It struck me today; there is an easy way to describe MySQL's Binary Log group commit improvements from MySQL 5.0-5.7 by using the example of a single ferry trying to ship passengers from point A to point B:

Ferry from Point A to point B

MySQL 5.0 Behaviour

In MySQL 5.0, the ferry will pick up the next passenger in line from point A, and transfer them to point B. The trip between A and B takes about 10 minutes return trip, so it's possible that several new passengers will arrive while the ferry is in transit. That doesn't matter; when the ferry arrives back at point A, it will only pick up the very next passenger in line.

MySQL 5.6 Behaviour

In MySQL 5.6, the ferry will pick up all passengers from the line at point A, and then transfer them to point B. Each time it returns to point A to pick up new passengers, it will collect everyone who is waiting and transfer them across to point B.

This is measurably better performance in real-life situations where many passengers tend to arrive while waiting for the ferry to arrive back at point A, and the trip between A and B tends to take some time. It is not so measurable in naive benchmarks that run in a single-thread.

There is no configuration necessary to enable group commit in 5.6. It works by default.

MySQL 5.7 Behaviour

MySQL 5.7 behaves similarly to 5.6 in that it will pick up all waiting passengers from point A and transfer them to point B, but with one notable enhancement!

When the ferry arrives back at point A to pick up waiting passengers, it can be configured to wait just a little bit longer with the knowledge that new passengers will likely arrive. For example: if you know the trip between point A and point B is 10 minutes in duration, why not wait an extra 30 seconds at point A before departing? This may save you on roundtrips and improve the overall number of passengers that can be transported.

The configuration variables for artificial delay are binlog-group-commit-sync-delay (delay in microseconds) and binlog-group-commit-sync-no-delay-count (number of transactions to wait for before deciding to abort waiting).

Conclusion

In this example passengers are obviously transactions, and the ferry is an expensive fsync operation. It's important to note that there is just one ferry in operation (a single set of ordered binary logs), so being able to tune this in 5.7 provides a nice level of advanced configuration.

How important is it to use 2-byte and 3-byte integers?

One interesting feature of MySQL, is that it supports a very large number of integer data types. From the MySQL manual:


Type Storage Minimum Value Maximum Value
(Bytes) (Signed/Unsigned) (Signed/Unsigned)
TINYINT 1 -128 127
0 255
SMALLINT 2 -32768 32767
0 65535
MEDIUMINT 3 -8388608 8388607
0 16777215
INT 4 -2147483648 2147483647
0 4294967295
BIGINT 8 -9223372036854775808 9223372036854775807
0 18446744073709551615

I have previously written on How important it is to using the correct datatype in MySQL, but I find that between integer types I don't always follow my own advice. I predominantly use TINYINT, INT, and BIGINT, and usually as the default of SIGNED rather than UNSIGNED.

Am I lazy? I think it is best to run through examples...

Primary Key Integer

In this example lets say that we know that we will never have more than 65535 customers, so we have the choice of using a SMALLINT UNSIGNED at 2 bytes, or lazily an INT will accommodate 2147483647 customers at 4 bytes:

Option #1:
CREATE TABLE customers (
id SMALLINT UNSIGNED PRIMARY KEY auto_increment,
company_name VARCHAR(20),
..
..
last_order_date DATETIME
);
Option #2:
CREATE TABLE customers (
id INT PRIMARY KEY auto_increment,
company_name VARCHAR(20),
..
..
last_order_date DATETIME
);

I will project that there are 50000 rows in the table, which is about as close as you would want to get to the maximum range possible of SMALLINT. Using SMALLINT provides a 2 byte saving per row:

2 * 50000 = 100000 = A saving of 97KB

In InnoDB the primary key is included in every other index on the table, so technically we will save an additional ~97KB per-index, but I would say that this saving is too small to matter for most cases.

Attribute on a large table

Option #1:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL,
..
..
);
Option #2:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id INT NOT NULL,
..
..
);

In this example, I will say that there are 1 billion rows in the page_views table:

1 billion * 2 bytes saving per row = 1.8GB

So the choice of using a SMALLINT over an INT has more measurable savings in this case. On a per unindexed column basis I could be saving around 2GB. So with five SMALLINT columns on the same row, I could save 10GB.

Is 10GB worth it? It is maybe easier to answer this question when expressed as a percentage saving for a table. To use some optimistic overhead numbers (that assume very little fragmentation) for InnoDB:

  • ~13 bytes per row (MVCC features)
  • ~1KB per 16K page (page fill factor)

I'll look at two table definitions. One with the best case of all SMALLINT columns in place of INT, and one with a VARCHAR column that is assumed to be 50 bytes on average:

Option #1:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL, /* or INT */
page_id SMALLINT UNSIGNED NOT NULL, /* or INT */
category_id SMALLINT UNSIGNED NOT NULL, /* or INT */
region_id SMALLINT UNSIGNED NOT NULL, /* or INT */
time_id SMALLINT UNSIGNED NOT NULL /* or INT */
);
Option #2:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL, /* or INT */
page_uri VARCHAR(2048) NOT NULL, /* assume average length of 50 bytes */
category_id SMALLINT UNSIGNED NOT NULL, /* or INT */
region_id SMALLINT UNSIGNED NOT NULL, /* or INT */
time_id SMALLINT UNSIGNED NOT NULL /* or INT */
);
# INT/SMALLINT only:
(13B (overhead) + 8B (pkey) + 5*2) * 1.06 * 1 billion = 30.60 GB
(13B (overhead) + 8B (pkey) + 5*4) * 1.06 * 1 billion = 40.47 GB

# One VARCHAR column:
(13B (overhead) + 8B (pkey) + 4*2 + 50 (varchar)) * 1.06 * 1 billion = 77.9GB
(13B (overhead) + 8B (pkey) + 4*4 + 50 (varchar)) * 1.06 * 1 billion = 85.9GB

So in the INT/SMALLINT table there is about a 25% saving by using smaller data types, and about a 10% saving in the case of the single VARCHAR column.

I do think that 25% is worth caring about, so maybe in this case I could follow my own advice a little closer :) However, I would also consider this a close to best-case situation that I have used to illustrate a point. I would expect typical space savings to be less than 10%.

One other point to consider, is that changing the datatype of a column is not supported by Online DDL. It is important not to sub-optimize data type choices and have to make changes later.

Did I miss anything? Please leave a comment.

Optimizing InnoDB Transactions

(This is a cross-post from the Engine Yard blog. The original article appears here.)

Here is a question I've actually been asked a few times:

"I am writing a batch processing script which modifies data as part of an ongoing process that is scheduled by cron. I have the ability to group a number of modifications together into a transaction, but I'm not sure what the correct number is?"

First off, I think that this question is interesting not just in the context of batch processing, but it equally applies to all parts of the application. If you are designing a high throughput system for MySQL, there are actually some potential pain points that you can design your way around.

Potential Pain Points

Here are the situations where the size of the transaction could impact performance:

Very Small Transactions

Using MySQL's auto_commit mode (or modifying just one row at a time) can cause some internal pressure on InnoDB's redo log system, since it has to make sure that its in-memory buffer is safely flushed to disk. This flushing is required to be able to provide durability, and safely recover all changes that have been committed if the server crashes.

Note that this durability requirement can be disabled by setting innodb_flush_log_at_trx_commit=2 (default: 1), or the impact can be reduced by installing a RAID controller with a battery backed write cache or a fast SSD.

Very Long Transactions

To understand what I mean by "long", lets use this hypothetical example:

START TRANSACTION;
/* Modify Data */
sleep(60)
COMMIT;

The problem with these transactions, is that even if they do not modify much data, because of a feature called Multiversion Concurrency Control InnoDB internally needs to keep around older versions of rows that were modified in other transactions. There is an internal clean-up thread which is constantly trying to purge older versions, but in the case of a very long transaction it will sit and wait until after a COMMIT/ROLLBACK event occurs.

In the example above, I used sleep(60) to illustrate a point, but in a very high throughut system it may not even need to be this long to start impacting performce.

Very large Transactions

By very large transactions, I simply mean transactions which modify very large amounts of data. There are a couple of risks that you need to be aware of in this case:

  • Locks will be held for the duration of a transaction. By modifying more rows at once, you may increase the number of deadlocks or introduce lock wait contention.
  • Rolling back transactions can be 30 times the cost of a corresponding apply transaction. It is important to note that rolling back may be something you requested, but it will also happen in the case of a disconnect or a deadlock.

Closing Advice

I think it's easy to get lost in the details and sub-optimize to eliminate any potential problems. Fear not! I have a two step process to avoiding the vast majority of all issues:

  • Move intensive application processing logic to before or after transaction running. While you have a transaction open, try and make any changes that you need to make and COMMIT as soon as possible. i.e.
/* do intensive processing */
START TRANSACTION;
/* Modify Data */
COMMIT;
/* more intensive processing */
  • You usually can't go wrong by using transactions as they were originally intended: commit logical units of work together. As mentioned above you can actually run into problems using auto_commit with row-at-a-time processing, as well as batches that are too large. My advice would be to keep transactions to modifying less than 10K rows at a time.

Good luck!

A closer look at Online DDL in MySQL 5.6

MySQL 5.6 introduced online DDL for InnoDB tables, which means that a number of ALTER TABLE statements no longer block access to data modification while running.

I've written about this feature before: it was enough to make one of my freelancing customers fall in love with MySQL again. Today I wanted to go through what's covered and what's not.

The Manual

The MySQL manual actually does all the work for me, since there is a table under 14.2.12.1. Overview of Online DDL.

The particular column you want to pay attention to is "Allows Concurrent DML?", which means that means that neither reads or writes are blocked.

Operation In-Place? Copies Table? Allows Concurrent DML? Allows Concurrent Query? Notes
CREATE INDEX,
ADD
INDEX
Yes* No* Yes Yes Some restrictions for FULLTEXT index; see next row.
Currently, the operation is not in-place (that is, it
copies the table) if the same index being created was also
dropped by an earlier clause in the same
ALTER TABLE statement.
ADD FULLTEXT
INDEX
Yes No* No Yes Creating the first FULLTEXT index for a table
involves a table copy, unless there is a user-supplied
FTS_DOC_ID column. Subsequent
FULLTEXT indexes on the same table can
be created in-place.
DROP INDEX Yes No Yes Yes Modifies .frm file only, not the data file.
OPTIMIZE TABLE Yes Yes Yes Yes Uses ALGORITHM=INPLACE as of MySQL 5.6.17.
ALGORITHM=COPY is used if
old_alter_table=1 or
mysqld --skip-new
option is enabled. OPTIMIZE TABLE using
online DDL (ALGORITHM=INPLACE) is not
supported for tables with FULLTEXT indexes.
Set default value for a column Yes No Yes Yes Modifies .frm file only, not the data file.
Change auto-increment value
for a column
Yes No Yes Yes Modifies a value stored in memory, not the data file.
Add a foreign key
constraint
Yes* No* Yes Yes To avoid copying the table, disable
foreign_key_checks during
constraint creation.
Drop a foreign key
constraint
Yes No Yes Yes The foreign_key_checks option can be
enabled or disabled.
Rename a column Yes* No* Yes* Yes To allow concurrent DML, keep the same data type and only change the
column name.
Add a column Yes Yes Yes* Yes Concurrent DML is not allowed when adding an
auto-increment
column. Although ALGORITHM=INPLACE is
allowed, the data is reorganized substantially, so it is
still an expensive operation.
Drop a column Yes Yes Yes Yes Although ALGORITHM=INPLACE is allowed, the data is
reorganized substantially, so it is still an expensive
operation.
Reorder columns Yes Yes Yes Yes Although ALGORITHM=INPLACE is allowed, the data is
reorganized substantially, so it is still an expensive
operation.
Change ROW_FORMAT property Yes Yes Yes Yes Although ALGORITHM=INPLACE is allowed, the data is
reorganized substantially, so it is still an expensive
operation.
Change KEY_BLOCK_SIZE property Yes Yes Yes Yes Although ALGORITHM=INPLACE is allowed, the data is
reorganized substantially, so it is still an expensive
operation.
Make column NULL Yes Yes Yes Yes Although ALGORITHM=INPLACE is allowed, the data is
reorganized substantially, so it is still an expensive
operation.
Make column NOT NULL Yes* Yes Yes Yes When SQL_MODE includes
strict_all_tables or
strict_all_tables, the operation fails
if the column contains any nulls. Although
ALGORITHM=INPLACE is allowed, the data
is reorganized substantially, so it is still an expensive
operation.
Change data type of column No Yes No Yes
Add primary key Yes* Yes Yes Yes Although ALGORITHM=INPLACE is allowed, the data is
reorganized substantially, so it is still an expensive
operation. ALGORITHM=INPLACE is not
allowed under certain conditions if columns have to be
converted to NOT NULL. See
Example 14.9, "Creating and Dropping the Primary Key".
Drop primary key and add another Yes Yes Yes Yes ALGORITHM=INPLACE is only allowed when you add a new
primary key in the same ALTER
TABLE
; the data is reorganized substantially, so
it is still an expensive operation.
Drop primary key No Yes No Yes Restrictions apply when you drop a primary key primary key without
adding a new one in the same ALTER
TABLE
statement.
Convert character set No Yes No Yes Rebuilds the table if the new character encoding is different.
Specify character set No Yes No Yes Rebuilds the table if the new character encoding is different.
Rebuild with FORCE option Yes Yes Yes Yes Uses ALGORITHM=INPLACE as of MySQL 5.6.17.
ALGORITHM=COPY is used if
old_alter_table=1 or
mysqld --skip-new
option is enabled. Table rebuild using online DDL
(ALGORITHM=INPLACE) is not supported
for tables with FULLTEXT indexes.
Rebuild with "null" ALTER TABLE ...
ENGINE=INNODB
Yes Yes Yes Yes Uses ALGORITHM=INPLACE as of MySQL 5.6.17.
ALGORITHM=COPY is used if
old_alter_table=1 or
mysqld --skip-new
option is enabled. Table rebuild using online DDL
(ALGORITHM=INPLACE) is not supported
for tables with FULLTEXT indexes.
Set table-level persistent
statistics options
(STATS_PERSISTENT,
STATS_AUTO_RECALC
STATS_SAMPLE_PAGES)
Yes No Yes Yes Modifies .frm file only, not the data file.

Update: I have updated the table to reflect changes in MySQL 5.6.17.

Find that hard to remember?

What I also like about MySQL 5.6, is that it also introduces syntax to force ALTER TABLE statements to not run if they can not be performed online. Here is an example of using LOCK=NONE (must allow concurrent DML):

mysql> CREATE TABLE a (id INT NOT NULL PRIMARY KEY, a char(100),
 b char(100));
Query OK, 0 rows affected (0.02 sec)

mysql> ALTER TABLE a ADD INDEX idx_a (a), LOCK=NONE;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE a DROP PRIMARY KEY,  LOCK=NONE;
ERROR 1846 (0A000): LOCK=NONE is not supported. 
 Reason: Dropping a primary key is not allowed without 
 also adding a new primary key. Try LOCK=SHARED.

I recommend using this syntax whenever writing migration scripts so you are never surprised by how much impact they will have.

What's upcoming?

In general I would comment that the number of operations which switched to being online in 5.6 is quite substancial. In MySQL 5.7 two additional improvements to online DDL are available, and are already noted in the MySQL 5.7 manual page:

  1. Extending a VARCHAR column is online.
  2. Renaming an index is online.

Have you used Online DDL in 5.6? Please leave a comment and share your experiences.

Reducing MySQL durability for testing

A while ago, I wrote how to make MySQL perform durably in production. Today I want to do the opposite: configure MySQL (with the InnoDB storage engine) for when you really don't care about your data.

Continuous integration/automated testing is a good example of when you want to do this, since you usually test for functionality and not performance characteristics. It also helps in situations when the data is very easy to rebuild.

Obligatory Disclaimer: These settings are not safe for production environments!

Configuration Settings

  • Keep the binary log disabled (i.e. comment out any log-bin lines in configuration). For testing, you don't need to be able to point-in-time restore, or seed replicas with changes.

  • Set sync_frm=0. This makes any DDL (schema changes) unsafe, but removes an fsync that is required to make sure the changes are safely on disk.

  • Set innodb-flush-log-at-trx-commit=0. This reduces redo log file pressure as data is committed. I would rate this as one of the most important settings to change. The particular case when it will help the most is when you are running on hardware without a battery backed write cache.

  • Set innodb-doublewrite=0. This makes recovery unsafe, as InnoDB will not be able to recover from a partially written page. However, as with all these steps, we are working with the assumption that you will just throw out the data when that happens.

  • Set innodb-checksums=0. Page checksums are InnoDB's mechanism to make sure the underlying hardware isn't failing and corrupting data. It's usually only a small hit (although more measurable with SSDs), and is only used on loading/unloading pages from disk.

  • Set innodb_support_xa=0. This makes sure changes between the binary logs and synchronized. Since we don't care about recoverability, we can disable it.

  • Set innodb-log-file-size and innodb-buffer-pool-size appropriately large. I have a guide here. You can typically go much larger on the log files when you don't care about recoverability - 4G or 8G.

  • In your session set unique_checks=0 and foreign_key_checks=0. These two options are optional, since they are the only recommendation listed here which change behaviour. When set they remove internal constraint checking, which means that a unique key may not actually be unique.

    You may notice these options from files created by mysqldump.

What is an fsync

Many of these settings improve performance specifically because they reduce the number of fsync operations that MySQL needs to do. An fsync is required to keep data safe so it's not in an in memory buffer when power is lost.

But in making the buffers safe, fsync operations also reduce the amount of request re-ordering and merging that can be done at the layers below MySQL. This will typically drop performance substantially in all situations except when you have a RAID Controller with a battery backed cache.

A special note on libeatmydata

Stewart Smith also maintains a library to disable all fsync operations called libeatmydata. I do not expect it to deliver large performance gains over the configuration recommended above, but I can think one good situation to use it: when you have a desire to keep MySQL configuration the same in testing as it is in production. libeatmydata provides an external way to change MySQL's behaviour, since it just requires one small change to how mysqld is started.

Did I miss anything? Please leave a comment.

Benchmarking InnoDB page compression performance

InnoDB page compression is a feature that really interests me. I wrote this about it recently when describing how to improve the performance of large tables in MySQL:

"Use innodb page compression. For some workloads (particularly those with lots of char/varchar/text data types) compression will allow the data to be more compact, stretching out that performance curve for longer. It may also allow you to more easily justify SSDs which are typically smaller in capacity. InnoDB page compression was improved a lot in MySQL 5.6, courtesy of Facebook providing a series of patches."

After writing that, I decided to setup an experiment.

The Experiment

I wanted to find data that was typical to be stored in a database, but would also compress well. There is a huge potential for skew here, since if I used dummy data such as 'AAAAAAAAAAAA' it will compress very well. Likewise, jpeg images stored in blobs would unlikely compress any more than they already are. So I arrived at using the English version of Wikipedia, which is hopefully representative of "real data".

So after downloading the data set and importing it from its XML format into MySQL, I ran the following script:

DROP TABLE IF EXISTS text_16K_compressed;
DROP TABLE IF EXISTS text_8K_compressed;
DROP TABLE IF EXISTS text_4K_compressed;
DROP TABLE IF EXISTS text_16K_uncompressed;

CREATE TABLE `text_16K_compressed` (
  `old_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `old_text` mediumblob NOT NULL,
  `old_flags` tinyblob NOT NULL,
  PRIMARY KEY (`old_id`)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=16;

CREATE TABLE `text_8K_compressed` (
  `old_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `old_text` mediumblob NOT NULL,
  `old_flags` tinyblob NOT NULL,
  PRIMARY KEY (`old_id`)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;

CREATE TABLE `text_4K_compressed` (
  `old_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `old_text` mediumblob NOT NULL,
  `old_flags` tinyblob NOT NULL,
  PRIMARY KEY (`old_id`)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;

CREATE TABLE `text_16K_uncompressed` (
  `old_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `old_text` mediumblob NOT NULL,
  `old_flags` tinyblob NOT NULL,
  PRIMARY KEY (`old_id`)
) ENGINE=InnoDB;

select now();
INSERT INTO text_16K_compressed SELECT * FROM text;
select now();
INSERT INTO text_8K_compressed SELECT * FROM text;
select now();
INSERT INTO text_4K_compressed SELECT * FROM text;
select now();
INSERT INTO text_16K_uncompressed SELECT * FROM text;
select now();

With the results, there's really two aspects to measure compression by here - footprint and import time:

So for 4K and 8K pages, there is a >50% saving in disk space, since the table compresses down from 51G to 21G. In addition, it's pretty clear that (at least on my hardware) the compression doesn't have any impact on the import time: the reduced footprint actually helped the 8K compressed pages take less time to insert than the uncompressed.

So I believe I did manage to prove that compression is very useful. However I also acknowledge that there is some skew in my test:

  • INSERT SELECT is single threaded. Ideally a test should reflect a real-life requirement.
  • My single 5400 RPM hard drive does not reflect common configuration in production.
  • I also disabled the doublewrite buffer and set innodb-flush-log-at-trx-commit=2. It would be nice to demonstrate if this impacts the test.
  • I really should have waited for all dirty pages to flush between all tests. This was an oversight on my part, and I think this advantaged 16K uncompressed insert (which is just a little better than the others).

Full Disclosure

Some details about my setup:

  • Linux Mint 15 (Ubuntu 13.04)
  • MySQL 5.7.2
  • Intel i5-2400 4 core CPU @ 3.10GHz
  • Single 5400 RPM 1TB Hard Drive (WDC WD10EADS-00L5B1)
  • 32G RAM

MySQL Configuration changes from default:

innodb-buffer-pool-size=16G
innodb-log-file-size=4G
innodb-flush-log-at-trx-commit=2
innodb-doublewrite=0
innodb-file-format=barracuda

Raw output from tests:

morgo@shuttle ~/Downloads $ ~/sandboxes/msb_5_7_2/use mediawiki < create-compressed-tables.sql
now()
2013-10-28 16:12:04
now()             
2013-10-29 01:34:48
now()
2013-10-29 11:20:04
now()
2013-10-29 21:13:27
now()
2013-10-30 07:37:48

morgo@shuttle ~/sandboxes/msb_5_7_2/data/mediawiki $ ls -lS text*
-rw-rw---- 1 morgo morgo 63472402432 Oct 28 14:35 text.ibd
-rw-rw---- 1 morgo morgo 53741617152 Oct 30 07:40 text_16K_uncompressed.ibd
-rw-rw---- 1 morgo morgo 38176555008 Oct 29 01:51 text_16K_compressed.ibd
-rw-rw---- 1 morgo morgo 21768437760 Oct 29 21:32 text_4K_compressed.ibd <-- See below
-rw-rw---- 1 morgo morgo 21768437760 Oct 29 11:40 text_8K_compressed.ibd
-rw-rw---- 1 morgo morgo        8642 Oct 28 16:12 text_16K_compressed.frm
-rw-rw---- 1 morgo morgo        8642 Oct 28 16:12 text_16K_uncompressed.frm
-rw-rw---- 1 morgo morgo        8642 Oct 28 16:12 text_4K_compressed.frm
-rw-rw---- 1 morgo morgo        8642 Oct 28 16:12 text_8K_compressed.frm
-rw-rw---- 1 morgo morgo        8642 Oct 27 19:22 text.frm

An earlier version of this post featured incorrect results for 4K compressed pages (credit: @Skunnyk). Here are the results for 4K pages re-processed:

morgo@shuttle ~/Downloads $ ~/sandboxes/msb_5_7_2/use mediawiki < create-compressed-tables.sql
..
now()
2013-10-31 11:31:24
now()
2013-10-31 22:13:32

morgo@shuttle ~/sandboxes/msb_5_7_2/data/mediawiki $ ls -lS text_4*ibd
-rw-rw---- 1 morgo morgo 20858273792 Oct 31 22:24 text_4K_compressed.ibd

Ten ways to improve the performance of large tables in MySQL

Today I wanted to take a look at improving the performance of tables that cause performance problems based largely on their size. Some of this advice also applies to databases that are large in-aggregate over many tables, but I always find the individually large table a special-case that is problematic.

What you will normally find is that the speed that the table can be modified will trend down as the size increases. Here is what I am going to call the typical B+Tree index performance over time:

This graph taken from a post by MySQL@Facebook. It shows the performance degradation of inserting one billion rows into a table with insert buffer disabled (not recommended, and used for demonstration purposes only). Note that this is in log scale!

The benchmark is called iibench, and was designed by TokuTek.

So we should expect degradation of performance due to the structure of the index, but there are actually some ways that we can try and stretch out the curve, and not degrade as quickly.

Ten potential ways to reduce large table impact:

  1. Make sure to use InnoDB instead of MyISAM. MyISAM can be faster at inserts to the end of a table, but it has both table locking (limiting updates and deletes) and uses a single lock to protect the key buffer when loading data to/from disk, resulting in contention. It also does not have the change buffering feature described just below.

  2. InnoDB has change buffering (previously called the insert buffer), which is a feature to delay building secondary indexes that are not unique, and merge writes. It's further described by Facebook here. It's not shown in the graph above, but it can boost insert performance by quite a lot, and it's enabled by default. It was greatly improved in MySQL 5.5, so it is time to upgrade if you haven't.

  3. Partitioning may reduce the size of indexes, effectively reducing the table
    into many smaller tables. It also reduces internal index->lock contention, something that has been greatly improved in the MySQL 5.7.2 DMR.

  4. Use innodb page compression. For some workloads (particularly those with lots of char/varchar/text data types) compression will allow the data to be more compact, stretching out that performance curve for longer. It may also allow you to more easily justify SSDs which are typically smaller in capacity. InnoDB page compression was improved a lot in MySQL 5.6, courtesy of Facebook providing a series of patches.

  5. Sort and bulk load data into tables. Inserting in order will result in fewer page splits (which will perform worse on tables not in memory), and the bulk loading is not specifically related to the table size, but it will help reduce redo log pressure.

  6. Remove any unnecessary indexes on the table, paying particular attention to UNIQUE indexes as these disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint; prefer a regular INDEX.

  7. Related to the points 5 & 6, the type of primary key also matters. It is much better to use either an INT or BIGINT datatype than say a GUID, which will have a curve that degrades much faster. Having no PRIMARY KEY will also affect performance negatively.

  8. If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes. This optimization became true in MySQL 5.5.

  9. More memory can actually help here too. I frequently see people under spec memory on new database servers compared to what it actually costs these days. Simple advice: If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server. See here.)

  10. As well as memory, SSDs can help too. Much of the performance drop shown on the curve can be attributed to additional IO which is created as the table gets bigger. While a hard drive can do 200 operations per second (IOPS), a typical SSD will do 20K+

Converting MyISAM to InnoDB and a lesson on variance

I'm about to start working through converting a large MySQL installation to InnoDB. For the particular use case durability is desired, and with MyISAM a loss of power almost certainly guarantees data loss. Durability is not available as an option.

So the normal question people ask is.. okay, what do I pay for this feature?

I think if you are looking at the workload in question I actually believe InnoDB will perform better overall. Row-level locking, multiversion concurrency control, and the correct selection of hardware (fast-disks with a RAID controller + batter backed write cache) will really help it shine.

But my problem is that InnoDB could be better in 9 situations, and worse in 1. What matters more than anything else is performance regressions. That is...

"Our Customers Feel the Variance, Not the Mean" - a key concept in Six Sigma.

The way I like to explain this, is that if we were to go to a restaurant and order the same meal - or go to Toyota and order the same car, we should receive the same product. I can't get the equivalent to a race car and you get a Trabant. Nor can we allow some queries that were already fast enough get faster, and other queries become intolerably slower.

So what I am actively trying to do is identify situations where MyISAM will be faster than InnoDB. And so far I have on my watch list:

  1. Table Scans - MyISAM data is far more compact, and typically table scans need to examine far less data.
  2. Index Scans - In InnoDB, indexes will contain multiple versions and typically be much larger.
  3. Inserts - MyISAM has a special optimization where it can just insert into the end of a table. I think even with the battery backed write cache this will be hard to beat.
  4. Single-threaded workloads - maybe there's an hourly cron that in InnoDB will take significantly longer. InnoDB really performs better in concurrent workloads.

Facebook and Percona have previously written about the importance of variance reduction as well.

I should also note that durability is not the only reason to use InnoDB. Operational tasks like backups become much easier. But that's the subject for a different post :)

When does MySQL perform IO?

In my previous post, I wrote about when data gets loaded in and out of cache. The goal of caching is to avoid having to perform disk IO, which if we are talking about hard drives, each operation is about 20 million times slower than CPUs or about 100 thousand times slower than memory.

So today I want to cover the subject of when are we required to do IO? But before I get there I first have to describe how IO works on all our modern operating systems.

An Introduction to Buffered IO

When you read or write to a file, by default you are speaking to a layer in between you and your disks called buffered IO. This layer is designed to increase performance and adds important features such as caching. Without going into this in detail, it's important to note that:

  • When you READ data, you may be reading from the disk or a cache provided by the operating system.

  • When you WRITE data you are not writing to the disks, but to a queue which will be flushed in the background. This queue is a little bit different to how a message queue in an application may behave: it usually has a hard limit to the number of outstanding requests, and it does not always complete requests in purely FIFO order - but an optimized order that still makes sure there is a maximum time window a request can take (example).

    • WRITE operations are not crash-safe by default! The operating system does provide a way of syncing a file to make sure all of the buffers are flushed down to the disk; an fsync.

    • An fsync is usually very expensive because it forces the queue to empty, and reduces the amount of request reordering and merging. Sequential IO is really important for hard drives.

So now lets look at IO created by a MySQL Server using the InnoDB storage engine (default since MySQL 5.5).

Reading pages into memory

If a page is not located in cache already, InnoDB will load it into memory in the foreground while queries are waiting, unless the page loading was triggered via a read ahead operation in which case it will use one of its innodb_read_io_threads to make this happen in the background.

Writing modified pages back to disk

When you make a modification to an InnoDB page, InnoDB does not write it immediately to disk. Instead a delta is written to the transaction log (see below).

These modified-only-in-memory pages are typically reffered to as dirty pages. Writing down dirty pages happens as part of InnoDB's background process which continually performs work every 1 second, and it adjusts well to peaks and troughs in load with a feature introduced in MySQL 5.5 and further improved in MySQL 5.6 called adaptive flushing. However, if there are too many dirty pages and there is not enough free space, InnoDB may be forced to synchronously flush a dirty page to make free space.

This delaying allows InnoDB to perform fewer random writes, since there will usually be multiple changes to the same page, or by default pages in the same extent (1MB allocation unit on disk). MySQL also 5.6 allows you to disable this for SSDs.

(Note: Domas makes a comment about the doublewrite buffer below, and how I might be over-selling request merging :) It's outside of the scope covered here, but described in Facebook's blog and the MySQL manual).

Writing the transaction log

Any time you modify an InnoDB page a record of this modification is made to InnoDB's transaction log system (ib_logfile1 and ib_logfile0 on disk). This transaction log is designed to be sequentially written, and unless the server crashes the server will only ever write and not read from it.

The log files are cyclical by design. You can think of them as a concatentation of two (by default) that behave similar to tread on a tank. So the first one is filled, then the second, then InnoDB will start filling the first again. The writes to the transaction log will be 512B aligned and using the operating system's buffered IO. InnoDB actually relies on the operating system here to provide a cache of the log (which is not guaranteed). Since these writes are 512B - and a typical page is 4K what will happen without enough cache, is that the page will neeed to be read first, to make the partial-modification, then written back.

To reduce log file IO, InnoDB also has a log buffer (configured by innodb_log_buffer_size) that is filled with pending modifications which will be written and synced on any COMMIT by default.

Writing the slow log and general log

The general log is written out as each event is generated. It will be a sequential write, and is never synced or never read back (should be fairly low cost).

The slow query log is written out after query execution has finished. Similiar to the general log, it should also be very log cost.

Why I say "should" is that this also depends on filesystem choice. An fsync under ext3 does not flush one file, but all files. This has become known as the great fsync() bug. Also, the operating system will usually have a limit the amount of modifications queued up waiting to be written, so too much log activity could cause synchronous writing and impact other operations (see Domas' comment below).

IO on a Replication Master (aka binary log enabled)

The binary log is sequential IO written to as part of a COMMIT operation, but by default it is never flushed, so it is not crash-safe! It requires setting sync_binlog=1.

The binary log does not have its own in memory buffers, so it's possible if an out of date slave came online that the master server may need to read old binary log files in order to send previous server events. However in practice the master pushes changes to online slaves as they are committed so it should be write-only.

The server maps changes in the binary log to InnoDB's transaction log to ensure consistency by default.

IO on a Replication Slave

The relay logs are in the same format as the binary logs, but the IO semantics are a little different. The slave's IO_THREAD retrieves new events from the master and writes sequential IO relay log files on the slave. Then the SQL_THREAD then reads the changes from the relay logs and applies them. This read operation does not have any caches inside of MySQL to ensure no IO is performed, and instead relies on the operating system's buffered IO to assist. It's possible to decrease the amount of relay logs kept on the slave to assist in a better 'cache fit', but this also means that if the master fails there may be changes the slave was not aware of.

Writing the relay logs is not crash-safe by default, and requires setting sync_relay_log=1.

Also, the status files which maintain a mapping between the relay logs and master's binary logs (which are not a 1:1) are also not crash-safe. A new option in MySQL 5.6 allows you to use an internal table to maintain these status files for added safety, but it is not enabled by default. The new options are master-info-repository and relay-log-info-repository.

Summary

I am sure I made a couple of simplifications (for example not describing the doublewrite buffer), but hopefully it's a good start. There are a couple of key take away points I want to make:

  • Reads are foregrounded (except read-ahead) and don't provide the same opportunity reorder and merge - so that makes them typically random IO and very expensive. If you run a diagnostic command like iostat and you see a lot of reads on a warmed up server - this is something adding memory can usually solve.

  • Most writes are sequential or backgrounded. Both InnoDB and the operating system try to produce sequential IO and merge requests.

  • While InnoDB has always been crash safe by default, MySQL has not. There are new features in MySQL 5.6 which makes replication more reliable.

When does MySQL data get loaded in and out of cache?

A cold cache, or a poorly tuned cache can be responsible for a number of performance problems. If we look at the data and indexes of InnoDB, the cache responsible is called the InnoDB buffer pool.

In earlier versions of MySQL the default value for the setting innodb_buffer_pool_size was 8M - which is a little out of date when you consider the recommended value to be 50-80% of system memory. In MySQL 5.5 the default value was increased to 128M, which is a comprimise made for users that may install MySQL on a development machine and not have it running as a dedicated server. In production, it is unlikely that you will buy a new server with less than 64GB of RAM, so it is quite typical that this setting is 50GB+

So lets talk about the behaviour of the InnoDB buffer pool -

Up until and including MySQL 5.5

When MySQL starts up, the buffer pool is completely empty. As queries are executed, MySQL will determine which pages are required - and if they are not in memory, they will be loaded from disk.

Initially, performance will be very poor - since there will be 100% cache misses, but over time as the buffer pool fills this should reduce.

Provided there is enough memory - MySQL will just keep loading pages into cache, and warm to a state where it will not have to perform any reads to the disk at all. Only writes will go to disk.

From MySQL 5.6 onwards

Because server warm time is becoming a more serious problem (it always existed, but it is exacerbated by us now having an abundance of RAM, but if we are using hard drives, they are not really any faster), Oracle introduced a feature in MySQL 5.6 - buffer pool warming.

What it does is on demand or on shutdown saves the addresses (space_id + page_id) of InnoDB pages to permanent storage.

Upon startup, these pages can then automatically be loaded back into memory straight away so that we can have cache misses for less time.

This feature is not turned on by default, but I suspect in a future version it may be. There are a number of other advantages to pre-warming such as being able to sort and merge read requests and load the data in much faster. Also, since it is only the addresses of pages being saved it only takes 8 bytes to point to each 16KB page, and there are no risks if there have been modifications since the last address saving operation ran.

Cache evictions

So we've described the "best case" so far, which is that data only gets loaded into cache and performance keeps getting better and better. However in practice, our cache is never unlimited. When our buffer pool gets to the point of being completely full it will need to free space in order to load new pages into memory. The behavior of this 'eviction' is version specific.

Up until MySQL 5.5

InnoDB previously used a classic Least Recently Used (LRU) algorithm. What this means is that each page has an order in a list. I like to think of it as a "Hot or not" score. Each time a page is accessed it gets promoted up the list to be considered "more hot". When it comes to freeing space, InnoDB just picks the least hot page and frees it from memory to make space for the next page.

In typical operation MySQL will not free pages from memory unless it needs to make space for another page. I say 'typical', because pages will be freed if a table is dropped for example.

MySQL 5.5 onwards

There are a number of operational limitations with using a classic LRU implementation on a database which has many users, and typically more than one workload. For example lets say that everything is working fine and we've figured out what is "hot or not", noting that we do not need as much memory as we do data because in most cases there will be pages that are not accessed frequently. Now imagine that a series of queries come in from mysqldump that want to run tablescans to export the data. What can happen, is the pages loaded in from the tablescans push out some of the good pages in our buffer pool, and even worse, as soon as the mysqldump operation is complete, they will no longer be required. So now post-mysqldump we have random IO as we try and re-settle cache-contents again.

So in MySQL 5.5 the LRU was split into two sublists:
- A 'young' sublist for our frequently accessed pages (default: 63%)
- An 'old' sublist for our table scan page accesses (default: 37%)

The configuration variable innodb_old_blocks_pct was introduced to configure the old/new split, and a variable innodb_old_blocks_time (default in 5.5: 0, in 5.6: 1000) was introduced to specify a minimum amount of milliseconds that a freshly loaded page must stay in the old sublist before it is ellible to be promoted to the young sublist.

As mentioned in the manual, and in blog posts it frequently makes sense to configure the innodb_old_blocks_time to a larger value, for example 1000 (ms).

A note on page churn

If you do not have enough memory to hold all of your working set in memory, what will happen is that the buffer pool will start juggling as it churns pages out of memory only to load them back in soon after. A small amount of this is probably okay, but it can start to degrade performance. The traditional rule of thumb is called "the 5 minute rule". This means that if you load a page into memory and are going to need it again in the next five minutes - it should not need to be churned out.