In my previous post, I wrote about when data gets loaded in and out of cache. The goal of caching is to avoid having to perform disk IO, which if we are talking about hard drives, each operation is about 20 million times slower than CPUs or about 100 thousand times slower than memory.

So today I want to cover the subject of when are we required to do IO? But before I get there I first have to describe how IO works on all our modern operating systems.

An Introduction to Buffered IO

When you read or write to a file, by default you are speaking to a layer in between you and your disks called buffered IO. This layer is designed to increase performance and adds important features such as caching. Without going into this in detail, it’s important to note that:

When you READ data, you may be reading from the disk or a cache provided by the operating system.
When you WRITE data you are not writing to the disks, but to a queue which will be flushed in the background. This queue is a little bit different to how a message queue in an application may behave: it usually has a hard limit to the number of outstanding requests, and it does not always complete requests in purely FIFO order – but an optimized order that still makes sure there is a maximum time window a request can take (example).
- WRITE operations are not crash-safe by default! The operating system does provide a way of syncing a file to make sure all of the buffers are flushed down to the disk; an fsync.
- An fsync is usually very expensive because it forces the queue to empty, and reduces the amount of request reordering and merging. Sequential IO is really important for hard drives.

So now lets look at IO created by a MySQL Server using the InnoDB storage engine (default since MySQL 5.5).

Reading pages into memory

If a page is not located in cache already, InnoDB will load it into memory in the foreground while queries are waiting, unless the page loading was triggered via a read ahead operation in which case it will use one of its innodb_read_io_threads to make this happen in the background.

Writing modified pages back to disk

When you make a modification to an InnoDB page, InnoDB does not write it immediately to disk. Instead a delta is written to the transaction log (see below).

These modified-only-in-memory pages are typically reffered to as dirty pages. Writing down dirty pages happens as part of InnoDB’s background process which continually performs work every 1 second, and it adjusts well to peaks and troughs in load with a feature introduced in MySQL 5.5 and further improved in MySQL 5.6 called adaptive flushing. However, if there are too many dirty pages and there is not enough free space, InnoDB may be forced to synchronously flush a dirty page to make free space.

This delaying allows InnoDB to perform fewer random writes, since there will usually be multiple changes to the same page, or by default pages in the same extent (1MB allocation unit on disk). MySQL also 5.6 allows you to disable this for SSDs.

(Note: Domas makes a comment about the doublewrite buffer below, and how I might be over-selling request merging 🙂 It’s outside of the scope covered here, but described in Facebook’s blog and the MySQL manual).

Writing the transaction log

Any time you modify an InnoDB page a record of this modification is made to InnoDB’s transaction log system (ib_logfile1 and ib_logfile0 on disk). This transaction log is designed to be sequentially written, and unless the server crashes the server will only ever write and not read from it.

The log files are cyclical by design. You can think of them as a concatentation of two (by default) that behave similar to tread on a tank. So the first one is filled, then the second, then InnoDB will start filling the first again. The writes to the transaction log will be 512B aligned and using the operating system’s buffered IO. InnoDB actually relies on the operating system here to provide a cache of the log (which is not guaranteed). Since these writes are 512B – and a typical page is 4K what will happen without enough cache, is that the page will neeed to be read first, to make the partial-modification, then written back.

To reduce log file IO, InnoDB also has a log buffer (configured by innodb_log_buffer_size) that is filled with pending modifications which will be written and synced on any COMMIT by default.

Writing the slow log and general log

The general log is written out as each event is generated. It will be a sequential write, and is never synced or never read back (should be fairly low cost).

The slow query log is written out after query execution has finished. Similiar to the general log, it should also be very log cost.

Why I say “should” is that this also depends on filesystem choice. An fsync under ext3 does not flush one file, but all files. This has become known as the great fsync() bug. Also, the operating system will usually have a limit the amount of modifications queued up waiting to be written, so too much log activity could cause synchronous writing and impact other operations (see Domas’ comment below).

IO on a Replication Master (aka binary log enabled)

The binary log is sequential IO written to as part of a COMMIT operation, but by default it is never flushed, so it is not crash-safe! It requires setting sync_binlog=1.

The binary log does not have its own in memory buffers, so it’s possible if an out of date slave came online that the master server may need to read old binary log files in order to send previous server events. However in practice the master pushes changes to online slaves as they are committed so it should be write-only.

The server maps changes in the binary log to InnoDB’s transaction log to ensure consistency by default.

IO on a Replication Slave

The relay logs are in the same format as the binary logs, but the IO semantics are a little different. The slave’s IO_THREAD retrieves new events from the master and writes sequential IO relay log files on the slave. Then the SQL_THREAD then reads the changes from the relay logs and applies them. This read operation does not have any caches inside of MySQL to ensure no IO is performed, and instead relies on the operating system’s buffered IO to assist. It’s possible to decrease the amount of relay logs kept on the slave to assist in a better ‘cache fit’, but this also means that if the master fails there may be changes the slave was not aware of.

Writing the relay logs is not crash-safe by default, and requires setting sync_relay_log=1.

Also, the status files which maintain a mapping between the relay logs and master’s binary logs (which are not a 1:1) are also not crash-safe. A new option in MySQL 5.6 allows you to use an internal table to maintain these status files for added safety, but it is not enabled by default. The new options are master-info-repository and relay-log-info-repository.

Summary

I am sure I made a couple of simplifications (for example not describing the doublewrite buffer), but hopefully it’s a good start. There are a couple of key take away points I want to make:

Reads are foregrounded (except read-ahead) and don’t provide the same opportunity reorder and merge – so that makes them typically random IO and very expensive. If you run a diagnostic command like iostat and you see a lot of reads on a warmed up server – this is something adding memory can usually solve.
Most writes are sequential or backgrounded. Both InnoDB and the operating system try to produce sequential IO and merge requests.
While InnoDB has always been crash safe by default, MySQL has not. There are new features in MySQL 5.6 which makes replication more reliable.

Domas Mituzas

Morgan,

1. innodb_read_io_threads are for read-ahead only (so, not used much), most of InnoDB reads are done by query threads. you were kinda correct in your summary, but that “foregrounded” collides with “reading pages into memory” section.

2. InnoDB overwrites transaction log pages (and fsyncs in between overlapping writes), so, it is not really sequential.

3. pro-tip for transaction log would be issuing 4k writes (percona build has that option)

4. the amount of merges done by InnoDB and especially by OS are limited by double-write buffer choke-point.

5. Excessive writing by logs (general query log, slow query log) can hit OS foreground checkpointing and slow everything down in unexpected place.
- Morgan Tocker
  
  Domas, thank you for your reviewing 🙂 I will definitely change the note about read_io_threads.
  
  For the transaction log writes – I am trying to write my blog about 5.5/5.6 so it applies to Oracle releases.. I also didn’t want to explain the motivation (O_DIRECT) or doublewrite buffer 🙂
  
  I will include something in my description of “An Introduction to Buffered IO” that says that the queue depth is usually limited. By saying excessive log writing chokes performance, you are talking about things like an IO scheduler limit on nr_requests?
  - Domas Mituzas
    
    Well, double-write buffer introduces see-saw pattern in InnoDB write I/O (as double-write buffer write is exclusive from other innodb write activity).
    
    Regarding log writes, no, I mean more the land of vm.dirty*
    
    nr_requests of course is another narrow area that does not allow too much of merging to happen (though I don’t believe in ordered I/O, apart from direct merges).
- repls
  
  Domas,
  although InnoDB overwrites redo log pages, but i think most of time the writes are sequential other than when start the overwrite. then the next time is really sequential(assume transaction is small), then that is to say only overwrites lead to some random write,but the proportion of overwrite in total IO maybe small. what do you think about it?
  - Domas Mituzas
    
    Well, you can always measure. Usually it overwrites in same area that it has written just before, but if you’re waiting for fsync to happen the sequentiality doesn’t help much there (assuming you’re on same disk as everything else and head movement is not accounted).
    - repls
      
      if redo log file and data file on the same disk, fsync redo log file did not happen completly sequential,because mysql also need to fsync data file when doing redo log fsync(one disk have only one head for HDD). but if the data file and redo log file not on the same disk ,then i think fsync to redo log file is almost sequentiality, in fact we do it in our product environment.
Dimitri

I’m also usually disabling neighbor flushing whatever SSD is used or not.. – in most cases if storage level is able to follow required flushing rate (in fact REDO rate), then your workload remains stable. If it cannot follow, then even if you’ll flush neighbors, you’ll hit a QPS drop once reached out of free space in REDO.. (while if you’re flushing neighbors when flushing is ok, then you’re not flushing all required amount of pages to follow the REDO rate, and finally reaching out of free space in REDO as well)..

Don’t know if I’m explaining well, but in few words it’s so 😉

Rgds,
-Dimitri
- Morgan Tocker
  
  I skipped sizing advice for the transaction (REDO) logs, and when they run out of space. Baron wrote a post on this in 2008: http://www.mysqlperformanceblog.com/2008/11/21/how-to-calculate-a-good-innodb-log-file-size/
  
  This advice was pre-5.5 so the wording is a little strong against multi-GB log files (which is now much faster).
  
  Thank you for contributing!
  - Dimitri
    
    Morgan, my point was about neighbor pages flushing switch option (I’m switching it off even I’m not using SSD (http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_neighbors))..
    
    For REDO log sizing – since 5.6 use as big as you want (there are a lot of monitoring counters now there to see how REDO is used, etc.)..
    
    Rgds,
    -Dimitri
rwinkel

So I have a related question. I do backups every night using rsnapshot and I use mysqldump to create the db dump for backing up. But if the db hasn’t changed I don’t want to do the dump because rnsapshot can save tons of space and time if it realizes the file hasn’t changed (by looking at the size and timestamp). To check for any db updates I look at the date stamp on the dump file and compare it to the date stamps of the files in /var/lib/mysql/db_name. I thought if I did a “flush tables” first it would force a write to any files that have changed but apparently this isn’t always true. Can anyone tell me how to force mysql to flush any pending updates to their final destination on disk? I know about the –flush option but I don’t want to dog things down with that. Any help would be much appreciated!
- Morgan Tocker
  
  To answer your question – there is no way to force flushing all pending changes to disk. FLUSH TABLES gets close, but InnoDB may have pending changes and does not offer a mechanism to force a hard checkpoint.
  
  May I suggest you take a look at MySQL Enterprise Backup, or xtrabackup? Both offer an incremental backup.
Andhraite

Morgan, We have mysql Master slave setup.

We are having issues with slave lag & most of times its read io lagging.

How can we improve it.

When does MySQL perform IO?