MySQL Certification Study – DBA Topics

For the MySQL DBA Exam, I wrote that I needed to brush up on Security and High Availability techniques. So let’s take a look at these two in more detail.

Security

There were quite a lot of enhancements to security introduced in MySQL 5.6, and I want to make sure I’m not caught off guard by new questions. The major changes (with links to manual pages) are:

  • Password expiration support
  • Enforce password policy such as minimum length or special chars requirements.
  • sha256_password for stronger password hashing.
  • Secure-auth enabled by default. This means clients using passwords stored in pre-MySQL 4.1 format won’t be able to connect.
  • mysql_config_editor has been introduced. The intention of this utility is to stop users from entering passwords in command line arguments.
  • Passwords are no longer written to log files.

Enterprise Edition:
PAM module for providing external authentication such as Unix passwords or an LDAP directory (actually this change was in MySQL 5.5).
* Native Windows authentication is also supported.
Audit log plugin for policy-based monitoring and logging (technically also introduced in MySQL 5.5).

Additionally, it’s worthwhile looking at these manual pages:
* http://dev.mysql.com/doc/refman/5.6/en/security.html
* http://dev.mysql.com/doc/refman/5.6/en/general-security-issues.html

High Availability Techniques

I wrote about availability a couple of days ago, but this is only part of the problem. The certification wants you to know about the technologies that have been certified as officially supported:

  • DRBD – Most people describe DRBD as “Network level RAID1 for Linux”. It replicates disk-block changes from one server to another, and is typically configured as two servers: one active, one passive, with a virtual IP address pointing to the current ‘active server’.

    Actually, that’s a bit of a simplification: DRBD only provides the disk-block replication. It needs a cluster manager such as Pacemaker or Corosync to do the virtual IP address and failover part.

  • Windows Failover Clustering – This solution is more analogous to Pacemaker than it is to DRBD. What WFC allows you to do is share a virtual IP address between at least two servers that access mysql data files via shared storage. WFC will ensure that only one server will have the mysql data files open at a time.

  • Solaris Cluster – MySQL provides an agent to work with the Solaris cluster framework.

  • OVM Templates for MySQL Enterprise – What this solution does, is allows MySQL to run as a virtual machine. If the underlying physical host fails, the MySQL VM can be migrated to another physical host. Live migrations are also supported if the host hardware requires maintenance. This kind of setup relies on shared storage.

Ten ways to improve the performance of large tables in MySQL

Today I wanted to take a look at improving the performance of tables that cause performance problems based largely on their size. Some of this advice also applies to databases that are large in-aggregate over many tables, but I always find the individually large table a special-case that is problematic.

What you will normally find is that the speed that the table can be modified will trend down as the size increases. Here is what I am going to call the typical B+Tree index performance over time:

This graph taken from a post by MySQL@Facebook. It shows the performance degradation of inserting one billion rows into a table with insert buffer disabled (not recommended, and used for demonstration purposes only). Note that this is in log scale!

The benchmark is called iibench, and was designed by TokuTek.

So we should expect degradation of performance due to the structure of the index, but there are actually some ways that we can try and stretch out the curve, and not degrade as quickly.

Ten potential ways to reduce large table impact:

  1. Make sure to use InnoDB instead of MyISAM. MyISAM can be faster at inserts to the end of a table, but it has both table locking (limiting updates and deletes) and uses a single lock to protect the key buffer when loading data to/from disk, resulting in contention. It also does not have the change buffering feature described just below.

  2. InnoDB has change buffering (previously called the insert buffer), which is a feature to delay building secondary indexes that are not unique, and merge writes. It’s further described by Facebook here. It’s not shown in the graph above, but it can boost insert performance by quite a lot, and it’s enabled by default. It was greatly improved in MySQL 5.5, so it is time to upgrade if you haven’t.

  3. Partitioning may reduce the size of indexes, effectively reducing the table
    into many smaller tables. It also reduces internal index->lock contention, something that has been greatly improved in the MySQL 5.7.2 DMR.

  4. Use innodb page compression. For some workloads (particularly those with lots of char/varchar/text data types) compression will allow the data to be more compact, stretching out that performance curve for longer. It may also allow you to more easily justify SSDs which are typically smaller in capacity. InnoDB page compression was improved a lot in MySQL 5.6, courtesy of Facebook providing a series of patches.

  5. Sort and bulk load data into tables. Inserting in order will result in fewer page splits (which will perform worse on tables not in memory), and the bulk loading is not specifically related to the table size, but it will help reduce redo log pressure.

  6. Remove any unnecessary indexes on the table, paying particular attention to UNIQUE indexes as these disable change buffering. Don’t use a UNIQUE index if you have no reason for that constraint; prefer a regular INDEX.

  7. Related to the points 5 & 6, the type of primary key also matters. It is much better to use either an INT or BIGINT datatype than say a GUID, which will have a curve that degrades much faster. Having no PRIMARY KEY will also affect performance negatively.

  8. If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes. This optimization became true in MySQL 5.5.

  9. More memory can actually help here too. I frequently see people under spec memory on new database servers compared to what it actually costs these days. Simple advice: If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server. See here.)

  10. As well as memory, SSDs can help too. Much of the performance drop shown on the curve can be attributed to additional IO which is created as the table gets bigger. While a hard drive can do 200 operations per second (IOPS), a typical SSD will do 20K+

EXPLAIN PARTITIONS and EXPLAIN EXTENDED deprecation

In MySQL 5.7 we are planning to deprecate the syntax:

EXPLAIN PARTITIONS <insert query here>
EXPLAIN EXTENDED <insert query here>

.. and enable these two options by default.

The rationale is that:

  • Simple and consistent is always better. EXPLAIN FORMAT=JSON already behaves like these two flags are enabled, and if you have a partitioned table for example, it is unlikely that you would not want the PARTITIONS option. Having to remember two more flags makes the product harder to use.

  • The optimizer team has been busy refactoring and improving code quality. These two flags are supported by many if-statements, increasing complexity by more than we would like.

The intended deprecation plan is to automatically turn both flags on in MySQL 5.7 and issue deprecation warnings when using the older syntax. In MySQL 5.8, the syntax will be removed.

What do I expect will break?

That’s a good question, since EXPLAIN is really a DBA tool, and is unlikely to affect running applications. I expect it to be:

  • Automated tools that depend on column order of EXPLAIN, or EXPLAIN not producing a warning (as of MySQL 5.7).

  • Automated tools that explicitly use EXPLAIN PARTITIONS or EXPLAIN EXTENDED (as of MySQL 5.8).

Both I hope should be fairly simple fixes.

How can I make my tools more resilient?

While we’re on the subject of EXPLAIN and automated tools, it is a good segway to lead into EXPLAIN FORMAT=JSON, which is very well suited here. I would expect that with JSON output, applications are more likely to be able to tolerate changes to the information returned and in MySQL 5.7 the JSON is already much more detailed.

Top 10 advances to availability since MySQL 5.5

Recently I found myself talking to a company that was a big user of MySQL, but just hadn’t followed all of the latest developments so closely. When the conversation went to availability I was actually not as prepared as I would have liked, and forgot about some of the (quite large) improvements that have been made to MySQL recently.

So here is me trying to write it all down for the next time I’m in the same situation. Note that I am using the definition of Continuous availability, which means any service disruptions that impact the business not just reduced redundancy.

#1 – InnoDB as default

When: MySQL 5.5
Why:
* Row locks instead of table-locks means that there is less contention and sudden stalls as applications become more loaded (i.e. performance degrades much better).
* InnoDB also features Multiversion concurrency control, which means that queries that read data do not have to set locks – further increasing concurrency and reducing potential stalls.
* If MySQL crashes, InnoDB is able to perform crash recovery very quickly and come back online (it just replays through its log file). For MyISAM crash recovery takes much longer, as the whole table must be examined.

#2 – Crash safe Replication

When: MySQL 5.6
Why:
* Replication slaves are crash safe. This means when a slave crashes you no longer need to re-image its data from the master or another slave. The slave can resume from where it was.
* This results is less time without HA, and reduced operations team overhead.

#3 – Semi-sync replication

When: MySQL 5.5 (will get even better performance in MySQL 5.7).
Why:
* Semi-sync is an option where you can make sure at least one slave has a copy of the data.
* This means less chance of lost transactions if a master fails.

#4 – Improved Group-commit

When: MySQL 5.6
Why:
* This means that transactions being committed near the same time in InnoDB will merge together and write to transaction logs as one operation.
* This is important because it makes making MySQL run in durable mode not result in as big of a performance drop.
* With the addition of changes like this (and fast SSDs) I recommend durability to a lot of users.

#5 – Replication with GTIDs

When: MySQL 5.6
Why:
* Previously it was very difficult when you had a master fail with many slaves. You could not simply pick a new master from one of the slaves and reposition the other slaves under it.
* With GITDs failover and maintenance operations that change the topology are much easier.

#6 – Online DDL

When: MySQL 5.6
Why:
* A large number of DDL operations (such as adding indexes) now do not block other queries reading or writing to the table. The list of which are blocking writes (‘allows concurrent DML’) is also well documented in the manual.
* Syntax is also provided so that if the query can not run without locking, it will refuse to execute. For example: ALTER TABLE my_table ADD INDEX a (a), LOCK=NONE;.

#7 – Improved InnoDB crash recovery

When: MySQL 5.5
Why:
* It was discovered that InnoDB crash recovery time was taking a long time on new servers with large amounts of RAM.
* The performance was greatly improved with a new crash recovery algorithm in MySQL 5.1 (InnoDB plugin edition only) and MySQL 5.5 (by default).
* This is now a very serious issue if you are running an older version of MySQL on some of the hardware available today.

#8 – Improved adaptive flushing

When: MySQL 5.5 (improved in MySQL 5.6)
Why:
* For performance InnoDB delays writes by first writing to a transaction log in the foreground, and then writing to destined page locations in the background.
* In earlier versions of MySQL it is possible that the log files can ‘fill up’ and have no free space available for new modifications. This can result in sudden stalls as the server quickly frees up and makes log space available.
* In MySQL 5.5 an adaptive flushing algorithm is introduced (default: on) to flush pages more aggressively as free space is running low. This feature results in more consistent and stable response times for all queries.

#9 – LRU not victim to side workloads

When: Introduced MySQL 5.5, MySQL 5.6 (by default)
Why:
* When MySQL needs to perform sudden IO (such as in the case of a mysqldump) it is possible that the ‘good’ contents of the cache could be unloaded from memory.
* In MySQL 5.5, the cache algorithm was changed from a classic LRU to a young sublist (default: 63%) and old sublist (default: 37%) so that these expensive side-load queries could have a fixed resource limit applied to them.
* The configuration variable innodb_old_blocks_time was introduced to specify the minimum time in milliseconds that a page must be in the old sublist before it can be promoted to the young sublist. In 5.6 it defaults to 1000.

#10 – MySQL Utilities

When: Independent Release
Why:
* MySQL now releases an official set of MySQL Utilities that provide command-line utilities for maintaining and administering MySQL servers.
* Notably mysqlfailover provides a scripted way to failover when using Replication with GTIDs. MySQL Fabric (Not yet GA; Labs Release Only) provides sharding with High-Availability groups.

Heads up – Implicit sorting by GROUP BY is deprecated in MySQL 5.6

For those who were unaware, in MySQL the following statements are currently identical:

SELECT MAX(Population), CountryName FROM City GROUP BY CountryName;
SELECT MAX(Population), CountryName FROM City GROUP BY CountryName ORDER BY CountryName;

That is to say that regardless of whether or not you asked for it, whenever you chose to GROUP BY, you will also have data sorted and returned in that order too.

The problem with this, is that it can result in worse performing queries. Sorting either reduces the number of execution plans possible, or requires an additional step to sort the data. Which is why many DBAs advocate writing group by queries with ORDER BY NULL. i.e.

SELECT MAX(Population), CountryName FROM City GROUP BY CountryName;

Should be written as:

SELECT MAX(Population), CountryName FROM City GROUP BY CountryName ORDER BY NULL;

But, as Roland Bouman notes in BUG #30477 there is no standard which requires MySQL to order data in this way, and this behaviour is not present in other databases. So in MySQL 5.6, it was decided that relying on this implicit sorting should be deprecated. From
the manual:

“Relying on implicit GROUP BY sorting in MySQL 5.6 is deprecated. To achieve a specific sort order of grouped results, it is preferable to use an explicit ORDER BY clause. GROUP BY sorting is a MySQL extension that may change in a future release; for example, to make it possible for the optimizer to order groupings in whatever manner it deems most efficient and to avoid the sorting overhead.”

So nothing has changed yet, but please make sure you are using GROUP BY CountryName ORDER BY CountryName if your application requires it.

MySQL Certification Study – Stored Programs

I previously wrote that I am studying for the MySQL 5.6 exam, and that I’m less
confident in my skills as a database developer. When I went through the list of topics in the exam, one thing I knew I would have to study is stored programs.

So first, some definitions. From the manual:

Stored programs include these objects:

  • Stored routines, that is, stored procedures and functions. A stored procedure is invoked using the CALL statement. A procedure does not have a return value but can modify its parameters for later inspection by the caller. It can also generate result sets to be returned to the client program. A stored function is used much like a built-in function. you invoke it in an expression and it returns a value during expression evaluation.
  • Triggers. A trigger is a named database object that is associated with a table and that is activated when a particular event occurs for the table, such as an insert or update.
  • Events. An event is a task that the server runs according to schedule.

So a stored program is the broader name applied to code that runs in the database. A procedure is different to a function because it is executed via a call statement and can modify input parameters rather than just returning a value. Check.

The next point I have to look at is creation and executing stored procedures and functions. That’s covered on this manual page:

mysql> delimiter //

mysql> CREATE PROCEDURE dorepeat(p1 INT)
    -> BEGIN
    ->   SET @x = 0;
    ->   REPEAT SET @x = @x + 1; UNTIL @x > p1 END REPEAT;
    -> END
    -> //
Query OK, 0 rows affected (0.00 sec)

mysql> delimiter ;

mysql> CALL dorepeat(1000);
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT @x;
+------+
| @x   |
+------+
| 1001 |
+------+
1 row in set (0.00 sec)

mysql> CREATE FUNCTION hello (s CHAR(20))
mysql> RETURNS CHAR(50) DETERMINISTIC
    -> RETURN CONCAT('Hello, ',s,'!');
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT hello('world');
+----------------+
| hello('world') |
+----------------+
| Hello, world!  |
+----------------+
1 row in set (0.00 sec)

I also need to know how to implement error handling within a stored procedures. This is actually improved a lot in 5.6 with the addition of GET DIAGNOSTICS. Using the example from the manual page:

CREATE PROCEDURE do_insert(value INT)
BEGIN
  -- Declare variables to hold diagnostics area information
  DECLARE code CHAR(5) DEFAULT '00000';
  DECLARE msg TEXT;
  DECLARE rows INT;
  DECLARE result TEXT;
  -- Declare exception handler for failed insert
  DECLARE CONTINUE HANDLER FOR SQLEXCEPTION
    BEGIN
      GET DIAGNOSTICS CONDITION 1
        code = RETURNED_SQLSTATE, msg = MESSAGE_TEXT;
    END;

  -- Perform the insert
  INSERT INTO t1 (int_col) VALUES(value);
  -- Check whether the insert was successful
  IF code = '00000' THEN
    GET DIAGNOSTICS rows = ROW_COUNT;
    SET result = CONCAT('insert succeeded, row count = ',rows);
  ELSE
    SET result = CONCAT('insert failed, error = ',code,', message = ',msg);
  END IF;
  -- Say what happened
  SELECT result;
END;

So we DECLARE A CONTINUE HANDLER (manual pages here and here) and then optionally use GET DIAGNOSTICS for more fine-grained error information. We can also return an error using the SIGNAL functionality (manual page here).

Optimizer Enhancements in MySQL 5.7

The MySQL optimizer is getting better. MySQL 5.6 introduced:

  • File sort optimizations with small limit
  • Index Condition Pushdown
  • Batched Key Access and Multi Range Read
  • Postponed Materialization
  • Improved Subquery execution
  • EXPLAIN for Insert, Update, and Delete
  • Optimizer Traces
  • Structured EXPLAIN in JSON format

This was in addition to the InnoDB storage engine now offering improved statistics collection, leading to more stable query plans.

In Evgeny Potemkin’s session at MySQL Connect titled “MySQL’s EXPLAIN Command New Features“, two new features for 5.7 were announced. They are both incredibly useful, so I wanted to write a little about them.

EXPLAIN FOR CONNECTION

Normally with EXPLAIN, what you would be doing is finding the execution plan of a query you are intending to run, and then interpreting the output how you see fit.

What MySQL 5.7 will do, is give you the ability to see the execution plan of a running query in another connection. i.e.

EXPLAIN FORMAT=JSON FOR CONNECTION 2;

Why it’s useful:
* Plans can change depending on input parameters. i.e. WHERE mydate BETWEEN '2013-01-01' and '2013-01-02' may use an index, but WHERE mydate BETWEEN '2001-01-01' and '2013-10-17' may not.
* Plans can change as data changes.
* Plans can also change depending on the context of a transaction, with InnoDB offering multi-version concurrency control.
* Optimizer statistics can change, and it’s not impossible that the reason for the executing query being slow has something to do with it. It’s great to have conclusive proof and be able to rule this out.

Execution cost in EXPLAIN

MySQL uses cost based optimization to pick the best query execution plan when there are multiple choices available. It is very similar to how a GPS navigator adds up estimated time and picks the best route to a destination.

What this feature does is exposes the cost as a numeric value when running EXPLAIN FORMAT=JSON. To take an example using the world sample database:

mysql [localhost] {msandbox} (world) > EXPLAIN FORMAT=JSON SELECT City.* 
FROM City INNER JOIN Country ON City.countrycode=Country.code 
ORDER BY City.NAME ASC LIMIT 100\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "4786.00"
    },
    "ordering_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "2151.00"
      },
      "nested_loop": [
        {
          "table": {
            "table_name": "country",
            "access_type": "index",
            "possible_keys": [
              "PRIMARY"
            ],
            "key": "PRIMARY",
            "used_key_parts": [
              "Code"
            ],
            "key_length": "3",
            "rows_examined_per_scan": 239,
            "rows_produced_per_join": 239,
            "filtered": 100,
            "using_index": true,
            "cost_info": {
              "read_cost": "6.00",
              "eval_cost": "47.80",
              "prefix_cost": "53.80",
              "data_read_per_join": "61K"
            },
            "used_columns": [
              "Code"
            ]
          }
        },
        {
          "table": {
            "table_name": "City",
            "access_type": "ref",
            "possible_keys": [
              "CountryCode"
            ],
            "key": "CountryCode",
            "used_key_parts": [
              "CountryCode"
            ],
            "key_length": "3",
            "ref": [
              "world.country.Code"
            ],
            "rows_examined_per_scan": 9,
            "rows_produced_per_join": 2151,
            "filtered": 100,
            "cost_info": {
              "read_cost": "2151.00",
              "eval_cost": "430.20",
              "prefix_cost": "2635.00",
              "data_read_per_join": "151K"
            },
            "used_columns": [
              "ID",
              "Name",
              "CountryCode",
              "District",
              "Population"
            ]
          }
        }
      ]
    }
  }
}

Why it’s useful:

  • This exposes more transparency into optimizer decisions. DBAs can better understand what part of a query is considered expensive, and try to optimize. I think this is important, because I have heard a lot of DBAs make blanket recommendations like “joins are bad” or “sorting is bad”, but there needs to be context on how much data needs to be sorted. It makes us all speak the same language: estimated cost.
  • Cost refinement is an ongoing effort. As well as the introduction of new fast SSD storage, MySQL is introducing new optimizations (such as index-condition pushdown). Not all of these optimizations will be the best choice every time, and MySQL should ideally be able to make a right choice for all situations.

How do you use the Federated Storage Engine?

We are looking for community feedback on the use-cases for the Federated Storage Engine in MySQL.

Specifically, I would be interesting to hear which users find that the introduction of multi-source replication meets (or does not meet) their requirements.

For a bit of background:

The Federated Storage Engine was introduced in 5.0, but has long been disabled by default. It is useful for ad-hoc queries across MySQL servers, but it misses some of MySQL’s newer optimizations, and does not perform as well as we would like. Among it’s limitations it also does not support transactions.

By using multi-source replication, many of the downsides mentioned above are negated, since the tables can be made available to query locally. The disadvantage of course, is that there is a need to provision more storage (however, given the performance limitations of federated it is not certain it could fill this use-case either).

So please leave a comment, or get in touch. I would love to hear how you are using the Federated Engine, and what you plan to use multi-source replication for.

MySQL Certification – Filling gaps in my knowledge

Like many others, I’ll be taking the MySQL 5.6 certification exam at the end of this month. As Todd blogged, the quick reference pages are available online for the Developer Exam and the DBA Exam. I’ve long considered myself more of a DBA, so I know the the developer exam that I need to study up on these topics:

But there are also a few DBA topics I could benefit from a refresh on:

I’ll be writing more on my experiences studying these topics (and which manual pages helped me out) in the coming days.

MySQL 5.6 – Now with fewer gripes

I often find myself in conversations with long time MySQL users who tell me about little annoyances in MySQL that make the product harder to use. Sometimes these annoyances can stay in a product for seemingly forever, because they require a change in behaviour (breaking backwards compatibility) or they are the result of architectural limitations not easily fixed.

In either case, I think it’s important to commend the MySQL engineering team at Oracle, because as well introducing a number of big ticket features for MySQL 5.6, they actually invested a lot of time into closing out a lot of old bugs.

Here is my list of historical gripes, now resolved:

  1. innodb-file-per-table ON by default means it is easier to reclaim space caused by run-away tables.
  2. When you change innodb_log_file_size in configuration, InnoDB will now resize logs to match automatically.
  3. Temporal data types can now support microseconds.
  4. Silent storage engine substitution is disabled by default (sql_mode=NO_ENGINE_SUBSTITUTION).
  5. Subqueries are now optimized.
  6. Stored procedures offer a way to get more information on errors.
  7. Timestamps can be maintained on more than one column.
  8. Replication slaves are crash safe. This means when a slave crashes you no longer need to re-image its data from the master or another slave. It can resume from where it was.
  9. Replication failover is much easier with GTIDs. Previously it was very difficult when you had a master fail with many slaves. You could not simply pick a new master from one of the slaves and reposition the other slaves under it.
  10. Indexes can now be added online, which means less downtime for a lot of people.
  11. There’s a standard way to find unused indexes (performance_schema).

I should point out that don’t think MySQL’s alone in having these nuances – I think the political correct name is a “legacy feature” 🙂

In MySQL 5.7, my gripe about Control-C not working in the client has already been fixed. The Bug System also has the ability to vote for a bug as impacting you. So if you see something annoying, please click ‘Affects me’ to leave feedback.

Did I miss something? Please leave a comment.