In my previous post, I wrote about when data gets loaded in and out of cache. The goal of caching is to avoid having to perform disk IO, which if we are talking about hard drives, each operation is about 20 million times slower than CPUs or about 100 thousand times slower than memory.
So today I want to cover the subject of when are we required to do IO? But before I get there I first have to describe how IO works on all our modern operating systems.
An Introduction to Buffered IO
When you read or write to a file, by default you are speaking to a layer in between you and your disks called buffered IO. This layer is designed to increase performance and adds important features such as caching. Without going into this in detail, it’s important to note that:
When you READ data, you may be reading from the disk or a cache provided by the operating system.
When you WRITE data you are not writing to the disks, but to a queue which will be flushed in the background. This queue is a little bit different to how a message queue in an application may behave: it usually has a hard limit to the number of outstanding requests, and it does not always complete requests in purely FIFO order – but an optimized order that still makes sure there is a maximum time window a request can take (example).
WRITE operations are not crash-safe by default! The operating system does provide a way of syncing a file to make sure all of the buffers are flushed down to the disk; an fsync.
fsyncis usually very expensive because it forces the queue to empty, and reduces the amount of request reordering and merging. Sequential IO is really important for hard drives.
So now lets look at IO created by a MySQL Server using the InnoDB storage engine (default since MySQL 5.5).
Reading pages into memory
If a page is not located in cache already, InnoDB will load it into memory in the foreground while queries are waiting, unless the page loading was triggered via a read ahead operation in which case it will use one of its
innodb_read_io_threads to make this happen in the background.
Writing modified pages back to disk
When you make a modification to an InnoDB page, InnoDB does not write it immediately to disk. Instead a delta is written to the transaction log (see below).
These modified-only-in-memory pages are typically reffered to as dirty pages. Writing down dirty pages happens as part of InnoDB’s background process which continually performs work every 1 second, and it adjusts well to peaks and troughs in load with a feature introduced in MySQL 5.5 and further improved in MySQL 5.6 called adaptive flushing. However, if there are too many dirty pages and there is not enough free space, InnoDB may be forced to synchronously flush a dirty page to make free space.
This delaying allows InnoDB to perform fewer random writes, since there will usually be multiple changes to the same page, or by default pages in the same extent (1MB allocation unit on disk). MySQL also 5.6 allows you to disable this for SSDs.
(Note: Domas makes a comment below about the doublewrite buffer below, and how I might be over-selling request merging It’s outside of the scope covered here, but described in Facebook’s blog and the MySQL manual).
Writing the transaction log
Any time you modify an InnoDB page a record of this modification is made to InnoDB’s transaction log system (
ib_logfile0 on disk). This transaction log is designed to be sequentially written, and unless the server crashes the server will only ever write and not read from it.
The log files are cyclical by design. You can think of them as a concatentation of two (by default) that behave similar to tread on a tank. So the first one is filled, then the second, then InnoDB will start filling the first again. The writes to the transaction log will be 512B aligned and using the operating system’s buffered IO. InnoDB actually relies on the operating system here to provide a cache of the log (which is not guaranteed). Since these writes are 512B – and a typical page is 4K what will happen without enough cache, is that the page will neeed to be read first, to make the partial-modification, then written back.
Writing the slow log and general log
The general log is written out as each event is generated. It will be a sequential write, and is never synced or never read back (should be fairly low cost).
The slow query log is written out after query execution has finished. Similiar to the general log, it should also be very log cost.
Why I say “should” is that this also depends on filesystem choice. An
fsync under ext3 does not flush one file, but all files. This has become known as the great fsync() bug. Also, the operating system will usually have a limit the amount of modifications queued up waiting to be written, so too much log activity could cause synchronous writing and impact other operations (see Domas’ comment below).
IO on a Replication Master (aka binary log enabled)
The binary log is sequential IO written to as part of a
COMMIT operation, but by default it is never flushed, so it is not crash-safe! It requires setting
The binary log does not have its own in memory buffers, so it’s possible if an out of date slave came online that the master server may need to read old binary log files in order to send previous server events. However in practice the master pushes changes to online slaves as they are committed so it should be write-only.
The server maps changes in the binary log to InnoDB’s transaction log to ensure consistency by default.
IO on a Replication Slave
The relay logs are in the same format as the binary logs, but the IO semantics are a little different. The slave’s
IO_THREAD retrieves new events from the master and writes sequential IO relay log files on the slave. Then the
SQL_THREAD then reads the changes from the relay logs and applies them. This read operation does not have any caches inside of MySQL to ensure no IO is performed, and instead relies on the operating system’s buffered IO to assist. It’s possible to decrease the amount of relay logs kept on the slave to assist in a better ‘cache fit’, but this also means that if the master fails there may be changes the slave was not aware of.
Writing the relay logs is not crash-safe by default, and requires setting
Also, the status files which maintain a mapping between the relay logs and master’s binary logs (which are not a 1:1) are also not crash-safe. A new option in MySQL 5.6 allows you to use an internal table to maintain these status files for added safety, but it is not enabled by default. The new options are
I am sure I made a couple of simplifications (for example not describing the doublewrite buffer), but hopefully it’s a good start. There are a couple of key take away points I want to make:
Reads are foregrounded (except read-ahead) and don’t provide the same opportunity reorder and merge – so that makes them typically random IO and very expensive. If you run a diagnostic command like iostat and you see a lot of reads on a warmed up server – this is something adding memory can usually solve.
Most writes are sequential or backgrounded. Both InnoDB and the operating system try to produce sequential IO and merge requests.
While InnoDB has always been crash safe by default, MySQL has not. There are new features in MySQL 5.6 which makes replication more reliable.