An easy way to describe MySQL's Binary Log Group Commit

It struck me today; there is an easy way to describe MySQL's Binary Log group commit improvements from MySQL 5.0-5.7 by using the example of a single ferry trying to ship passengers from point A to point B:

Ferry from Point A to point B

MySQL 5.0 Behaviour

In MySQL 5.0, the ferry will pick up the next passenger in line from point A, and transfer them to point B. The trip between A and B takes about 10 minutes return trip, so it's possible that several new passengers will arrive while the ferry is in transit. That doesn't matter; when the ferry arrives back at point A, it will only pick up the very next passenger in line.

MySQL 5.6 Behaviour

In MySQL 5.6, the ferry will pick up all passengers from the line at point A, and then transfer them to point B. Each time it returns to point A to pick up new passengers, it will collect everyone who is waiting and transfer them across to point B.

This is measurably better performance in real-life situations where many passengers tend to arrive while waiting for the ferry to arrive back at point A, and the trip between A and B tends to take some time. It is not so measurable in naive benchmarks that run in a single-thread.

There is no configuration necessary to enable group commit in 5.6. It works by default.

MySQL 5.7 Behaviour

MySQL 5.7 behaves similarly to 5.6 in that it will pick up all waiting passengers from point A and transfer them to point B, but with one notable enhancement!

When the ferry arrives back at point A to pick up waiting passengers, it can be configured to wait just a little bit longer with the knowledge that new passengers will likely arrive. For example: if you know the trip between point A and point B is 10 minutes in duration, why not wait an extra 30 seconds at point A before departing? This may save you on roundtrips and improve the overall number of passengers that can be transported.

The configuration variables for artificial delay are binlog-group-commit-sync-delay (delay in microseconds) and binlog-group-commit-sync-no-delay-count (number of transactions to wait for before deciding to abort waiting).

Conclusion

In this example passengers are obviously transactions, and the ferry is an expensive fsync operation. It's important to note that there is just one ferry in operation (a single set of ordered binary logs), so being able to tune this in 5.7 provides a nice level of advanced configuration.

Semi-sync replication is not slow!

If you read Yoshinori's post about Semi-sync at Facebook, he lists the objective of using semi-sync as an alternative to running full durability on a master. That is to say that once you can guarantee writes have safely been shipped across the network, you may not strictly need to guarantee that they are safe locally.

This is something that I have been wanting to benchmark for a long time, and reading Jay's post about Semi-sync replication performance in MySQL 5.7 DMR4 and a conversation in last week's #dbhangops inspired me to explore this in more detail. For my tests, I will be using Master-Slave replication and three alternative definitions of durability:

  • Local Durability. By ensuring that sync_binlog=1 and innodb_flush_log_at_trx_commit=1 any changes that are committed locally will be crash-safe (provided that underlying hardware does not lie). A slave will be connected, but using asynchronous replication.
  • Network Durability. By enabling semi-sync replication, each change will be guaranteed on more than one machine. However, each machine will take a relaxed view to how safe the changes are locally (sync_binlog=0, innodb_flush_log_at_trx_commit=0). This is the MySQL Cluster definition of durability, and also some NoSQL systems. It requires that each replica has physical isolation (in the case of AWS: an availability zone).
  • Local + Network Durability. This is a strict definition where both semi-sync and network durability is employed. While it might seem excessively strict, what I like about this option, is that it has the potential to reduce administrative costs during failure events.

Testing

Basic setup details:

  • AWS EC2 m3.medium instances
  • Master in us-east-1b, slave in us-east-1d
  • Latest Amazon Linux AMI
  • MySQL 5.7 DMR4
  • Datadir stored on 40GB Provisioned IOPS EBS volume (900 IOPS)
  • sysbench 0.5 update_index.lua test

I measured both the network latency (ping) and disk latency (dd bs=512 count=1000 if=/dev/zero of=testfile oflag=dsync) prior to starting the test. Interestingly, both were the same at about 2ms.

My first test was to compare single threaded performance:

singlethreaded

I would say that the results are somewhat as expected:

  • Performance when not durable is awesome! (but probably not useful)
  • Local durability is quite expensive in single-threaded workloads. There is not much opportunity for group-commit, and the 2ms latency I have for an fsync hurts to be able to safely write the InnoDB redo logs + the binary logs.
  • Network durability actually performs better than local durability.
  • The combination of Local+Network durability is worse than network only or local only.

Now to repeat the same test, but in 32-threads:

32threads

What is interesting to note is that:

  • No durability still performs the best, but by a much smaller margin.
  • The improved group-commit in MySQL 5.6+ is really helping the local durability throughput increase.
  • Network durability is marginally ahead of local durability.
  • Network+Local is about 15% worse than just local durability.

Comparing Local versus Network Durability

So network durability scored better than local in both tests!?

I think at this point it's important to explain the difference between local and network durability in the event of a crash, using our current set of tools that are available. If we are using asynchronous replication with a durable local master, and crash safe slaves, then resuming replication is actually possible in the event of a crash. This is not true in the case of network durability, as writes could be lost on the failing machine. This is an important loss of functionality that is difficult to justify.

I would also point out that I think it is easier to hardware-accelerate local durability than it is network durability. The local fsync speed can be improved with an SSD or a RAID Controller with a battery-backed write cache. Network latency can be lowered too, but it might be harder to do so while keeping physical isolation.

Conclusion

For the reasons I listed, I still think local durability is still more useful when choosing just one option for durability. But at the same time, semi-sync is no slouch and it's great to see new developments in this area in MySQL 5.7.

What I did not test, is configuring MySQL to be mostly-durable locally, and fully network durable. There is a lot of potential to meet somewhere in the middle and very selectively disable features such as the InnoDB doublewrite buffer.

I do not think it is fair to label semi-sync as "slow", and I am hoping to convince more users that they should take a look :)

Faking a slave: Subscribing to mysql row-based-replication changes

In complex systems it's often useful to be able to receive notification when rows have been modified in MySQL so that you can invalidate various external caches or indexes. For example: memcached, Sphinx, Lucene.

In the case of MySQL's default statement-based replication this can be quite tricky to do, as it would likely require an SQL parser to determine what was intended to be modified. However, this is made much simpler with Row-based Replication (something I recommend switching to).

A C++ library exists to be able to listen to MySQL replication, but what I want to demonstrate is that it is also very simple to be able to do this with mysqlbinlog:


shell> mysqlbinlog --read-from-remote-server --stop-never --host localhost.localdomain --port 5616 -u msandbox -pmsandbox --verbose mysql_sandbox5616-bin.000004 | grep '^### '

To explain how this command works:

  • --read-from-remote-server tells mysqlbinlog to act like a slave, and fetch remote files rather than local (new option to 5.6).
  • --stop-never makes mysqlbinlog block waiting for a continual stream of updates, rather than exiting. Perfect!
  • --verbose rewrites row-based replication events to be pseudo SQL statements. The pseudo statements are very easy to parse. For example:
    ### DELETE FROM `test2`.`a`
    ### WHERE
    ###   @1=1
    ### INSERT INTO `test2`.`a`
    ### SET
    ###   @1=1
    ### UPDATE `test2`.`a`
    ### WHERE
    ###   @1=10
    ### SET
    ###   @1=20
    

    Noting that a multi-row statement will appear as individual statements via --verbose. Row-based events also default to sending all columns in the row, not just those that are changed.

  • The grep '^### ' statement is just a lazy way of stripping out everything except pseudo SQL statements.

Advanced Usage

You can fairly easily extend the above to track your progress reading through the master's binary logs. This will make it easier to resume from where you left off if there is a crash. Just remove the grep, and keep track of:

  • Positional markers just before DML events. These are marked in bold here:
    # at 191
    #140526 15:28:27 server id 10  end_log_pos 239 CRC32 0x559a84a8     GTID [commit=yes]
    SET @@SESSION.GTID_NEXT= '9f0ce61c-bb92-11e3-89fd-f056da47d247:17'/*!*/;
    # at 239
    #140526 15:28:27 server id 10  end_log_pos 312 CRC32 0xff074c19     Query   thread_id=1 exec_time=0 error_code=0
    SET TIMESTAMP=1401143307/*!*/;
    BEGIN
    /*!*/;
    # at 312
    #140526 15:28:27 server id 10  end_log_pos 357 CRC32 0x74b1ad7f     Table_map: `test2`.`a` mapped to number 71
    # at 357
    #140526 15:28:27 server id 10  end_log_pos 397 CRC32 0x2c6f8b8d     Write_rows: table id 71 flags: STMT_END_F
    
    BINLOG '
    C8CDUxMKAAAALQAAAGUBAAAAAEcAAAAAAAEABXRlc3QyAAFhAAEDAAF/rbF0
    C8CDUx4KAAAAKAAAAI0BAAAAAEcAAAAAAAEAAgAB//4KAAAAjYtvLA==
    '/*!*/;
    ### INSERT INTO `test2`.`a`
    ### SET
    ###   @1=10
    
  • Events that show that the binary log file itself is being rotated. For example, the result of FLUSH LOGS is:
    # at 4
    #691231 16:00:00 server id 10  end_log_pos 0 CRC32 0x7800af55   Rotate to mysql_sandbox5616-bin.000006  pos: 4
    

Conclusion

This might not be as robust in all cases as using the C++ API, but it sure beats the alternative. Gone are my days of writing complex sets of triggers to write to an "events" table, which I poll continually from an external script.

Proposal to change MySQL replication defaults

In the MySQL team @ Oracle, we've been discussing changing the default values for MySQL replication so that they are safer to use.

Specifically:

Setting Current Value Proposed New Value
sync_binlog 0 1
master-info-repository FILE TABLE
relay-log-info-repository FILE TABLE

I have written about all of these settings before in my post on deciding whether or not to make MySQL durable. The short version is that:

  • By default, if MySQL crashes you could lose data.
  • You don't get to chose what you lose, it could be a $10 order, or a $10M order.
  • Independent of losing orders, DBA time is expensive. Having to perform post-failure investigation to handle discrepancies has a high cost.
  • The cost of durability is much lower with SSDs. In a comment, Daniël van Eeden also correctly said that the cost is much lower in 5.6 with group commit (thanks Daniël!).
  • I recommend most users should be running with MySQL as fully durable.

Now to explain each of these proposed changes:

Sync_binlog

By setting sync_binlog=1 you are ensuring that any changes written to the binary log will not be lost in the event of a crash. This is a problem because with the current default of sync_binlog=0, a replication master crashing may lose events and require all slaves to be re-imaged with a fresh copy of data to be consistent.

It's important to note that changing this setting has a downside: syncing the binary log causes a performance impact. As I mentioned above, this was reduced significantly in MySQL 5.6 with group commit, but it will still be evident in some workloads. The most unfortunate may be in single-threaded performance on systems with hard-drives and no RAID controller with a battery backed write cache.

Master-info-repository/relay-log-info-repository

These two options were first introduced in MySQL 5.6 as part of a feature called transactional replication. When both set to TABLE, they store the internal replication coordinates in an InnoDB table, and COMMIT changes as part of the same transaction as the replication events being applied.

What this means in practical terms, is that slaves are in a consistent state when they crash. Replication can just resume processing events from the point of failure, and data does not need to be re-imaged from the master or another slave.

Conclusion

As with previous proposals, we are seeking feedback on how these changes will impact you.

  • Do you currently use sync_binlog and transactional replication?
  • Have you considered these features, but decided not to switch? (This is a case where we'd really love to hear from you).
  • Do you agree that these changes will make MySQL safer, and easier to administer?

Please leave a comment, or get in touch!

The slow decline of using MySQL read slaves

Once upon a time, it was a very common MySQL choice to have an application split its queries between two connections, sending reads to a slave and writes directly to the master.


I won't say this technique is no longer used (it certainly is) but I have been observing its popularity decline slowly over the years. Today I wanted to try and provide insight into some of the possible reasons:

  1. InnoDB as default. MyISAM table locks are very high impact for high throughput systems - worse still, MyISAM does not have MVCC, an algorithm that allows InnoDB reads to be non blocking. Running with multiple replication slaves was very important to distribute read-locks across an array of servers. With InnoDB, for most people this should now be a non-issue.
  2. Single-threaded replication. When hardware has multiple CPUs and multiple disks in RAID (and many SSDs in singular) it often requires multiple concurrent threads to be able to exploit all of the capacity that it can offer. Master servers can exploit this capacity very well, but they then replicate to slaves through a single thread. This creates a situation where the problem is not read scaling (which splitting reads/writes solves very well) but write scaling. The slaves very quickly become the bottleneck as they can not keep up.
  3. Increased desire for ACID behaviour. MySQL replication slaves are asynchronous[1], which means that if the application does an immediate read-after-write, then there is a good chance stale data might be served from a slave. Actually, it's worse than that: with single-threaded slaves, if there is a 10 second update on the master, there is at least a 10 second delay on all slaves.

    Since the data is in an inconsistent state, mixing reads and writes between masters and slaves like this is not ACID compliant and introduces what I would call "asynchronous failures". This leads to more time spent debugging errors.

    MySQL has also been reducing the performance hit to run ACID (including durability) in recent versions. I recommend it for most people.

    [1] Semi-sync introduced in MySQL 5.5 only ensures one slave is up to date.

  4. Increased performance-per-server. From MySQL 5.5 onwards there has been a huge focus on performance. Add to that increased memory in new servers, fast SSDs, it is possible that for some users their lifetime performance needs can be answered by a single server without the need for architectural changes.

Of these points, I suspect 1, 2, & 3 are the most likely, and to some degree 2 (single-threaded replication) makes 3 (consistency) worse because of the single threaded blips. There has been a lot of progress made to improve this in new versions:

  • MySQL 5.6 introduced parallel replication slaves across schemas. This increases the write throughput of slaves significantly, you just have to split tables across multiple databases then set --slave-parallel-workers to take advantage.
  • MySQL 5.7 DMR2 takes this one step further and offers intra-schema parallel replication. This offers parallel slaves and can keep all tables in the same schema.

So it is possible that we may actually see a return of read/write split in coming years. But it is also possible that as MySQL Fabric matures we may see users just rely on sharding instead, even though these two solutions should be orthogonal. It is so hard to predict the future.

I should also point out: I still see slaves immensely valuable for HA/DR as well as a place to send very heavy reporting queries to. This hasn't changed.

What do you think about read/write split?

Row-based Replication

Row-based Replication was introduced in MySQL 5.1. It's not the default (yet), but I really like it. I wanted to tell you what limitations you may face with statement-based, and why you should join me in making the switch.

Write scalability of certain statements

Statement based binary logging requires additional locking to ensure slave-consistency. To illustrate, here is a naive example:

CREATE TABLE users (
 user_id INT NOT NULL primary key,
 user_name varchar(60) NOT NULL,
 is_deleted TINYINT DEFAULT 0,
 UNIQUE KEY(user_name)
);

INSERT INTO users (user_id, user_name, is_deleted) VALUES
 (1, 'ted', 0),
 (2, 'bill', 0),
 (3, 'fred', 0),
 (4, 'mike', 1),
 (5, 'barry', 0);

session1> START TRANSACTION;
session1> DELETE FROM users WHERE is_deleted = 1; # 1 row matches (user_id 4), deleted.
session2> START TRANSACTION;
session2> UPDATE users SET is_deleted = 1 WHERE user_id = 5; # 1 row matches.
session2> COMMIT;
session1> COMMIT;

Because statements are only written to the binary log once committed, on the slave session #2 would apply first, and result in two rows being updated by session #1, leading to incorrect results!

InnoDB prevents against this race condition by setting additional locking. After session #1 is modified no other sessions will be updated/insert into the range of is_deleted=1 until session #1 commits.

Actually.. it gets even worse than that, because it will lock the rows based on whatever index can be used. In my case, there's no index so I will lock every rows in the table (eek).

Even with good indexing, statements that have the pattern WHERE col BEWEEN x AND y or WHERE id > N will require gap locking to prevent data being inserted into the range, again to prevent the same inconsistency on the slave.

InnoDB has always had a setting to disable this locking, called innodb_locks_unsafe_for_binlog. As the name suggests, this is totally unsupported and you will likely get replication drift.

However, astute readers will notice that since row-based replication sends the actual row images of changes to the slaves, this race condition problem no longer exists. And indeed, once you switch to row-based replication binlog-format=ROW, and change the isolation level to transaction-isolation=READ-COMMITTED, these additional locks are no longer set!

Batched insert scalability

When you enable row-based replication, it is safe to set innodb_autoinc_lock_mode = 2. This improves auto-increment scalability with batch INSERT statements.

With the default of innodb_autoinc_lock_mode = 1, the per-table auto_increment generator needs to be locked for the duration of batch insert statements. This is because as rows are written to the binary log, only the first INSERT_ID is seeded to the slave, and locking gurantees deterministic slave replay. i.e.

# Binary logged multi-insert
SET INSERT_ID=1;
INSERT into my_table (id, value) VALUES
(NULL, 'Row 1'),
(NULL, 'Row 2'),
(NULL, 'Row 3');

Can you see the race-condition? In between each row we need to protect the auto-increment value so nobody could grab one of the IDs, leading to a conflict on the slave as it tries to just use the next number.

This locking is batched-insert only. It previously also applied to single-row inserts, but that was fixed in MySQL 5.1. You can enable the 5.0 and below behaviour by setting innodb_autoinc_lock_mode = 0, but I can't think of a good reason to do this.

Reasons to not use Row-based replication

I think the two biggest complaints have always been:

  1. Row-based binary logging produces much larger binary log files. This is largely addressed in 5.6 by setting binlog_row_image=minimal.

  2. Row-based events are harder to debug, since you could not previously see the actual statement that generated the row events. This is addressed in MySQL 5.6 with
    binlog-rows-query-log-events. MySQL 5.7 will also make this easier because the need to hand-audit events is reduced with idempotent binary log replay.

Summary

Statement-based replication was an okay choice in the past where we had single-core machines, but today scaling requires us to lock less, and perform more work in parallel. It's all about concurrency, concurrency, concurrency.

Have you made the switch? Something else holding you back? Let me know your thoughts in the comments.