The MySQL 5.7 Optimizer Challenge

In the MySQL team, we have been working really hard on refactoring the optimizer and improving the cost model. The hacks of storage engines lying to the optimizer are being rolled back, and your chances of getting an optimal query plan should now be much higher than in prior releases of MySQL.

The optimizer team has also allowed cost constants to be configurable on both a server and a storage engine basis, and we are confident that the default InnoDB engine will always work “as good as MyISAM” (which has a natural advantage, in that the optimizer was originally largely built around it.)

Today, I want to issue a challenge:

Find an example where the optimizer picks the wrong execution plan for InnoDB tables but is correct for MyISAM. If you can demonstrate a reproducible testcase, I have a polo with MySQL 5.7 Community Contributor on it waiting for you.

The supplies of this special edition t-shirt are limited, but I will ship it to you anywhere in the world ๐Ÿ™‚

The MySQL 5.7 Community Contributor Polo, as modeled by Daniรซl van Eeden.
The MySQL 5.7 Community Contributor Polo, as modeled by Daniรซl van Eeden. I’m the guy on the left.

Optimizer Trace and EXPLAIN FORMAT=JSON in 5.7

I accidentally stumbled upon this Stack Overflow question this morning:

I am wondering if there is any difference in regards to performance between the following:

SELECT ... FROM ... WHERE someFIELD IN(1,2,3,4);
SELECT ... FROM ... WHERE someFIELD between  0 AND 5;
SELECT ... FROM ... WHERE someFIELD = 1 OR someFIELD = 2 OR someFIELD = 3 ...;

It is an interesting question because there was no good way to answer it when it was asked in 2009. All of the queries resolve to the same output in EXPLAIN. Here is an example using the sakila schema:

mysql> EXPLAIN SELECT * FROM film WHERE film_id BETWEEN 1 AND 5\G
mysql> EXPLAIN SELECT * FROM film WHERE film_id IN (1,2,3,4,5)\G
mysql> EXPLAIN SELECT * FROM film WHERE film_id =1 or film_id=2 or film_id=3 or film_id=4 or film_id=5\G
********* 1. row *********
           id: 1
  select_type: SIMPLE
        table: film
   partitions: NULL
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 2
          ref: NULL
         rows: 5
     filtered: 100.00
        Extra: Using where

Times have changed though. There are now a couple of useful features to show the difference ๐Ÿ™‚

Optimizer Trace

Optimizer trace is a new diagnostic tool introduced in MySQL 5.6 to show how the optimizer is working internally. It is similar to EXPLAIN, with a few notable differences:

  • It doesn’t just show the intended execution plan, it shows the alternative choices.
  • You enable the optimizer trace, then you run the actual query.
  • It is far more verbose in its output.

Here are the outputs for the three versions of the query:

  1. SELECT * FROM film WHERE film_id BETWEEN 1 AND 5
  2. SELECT * FROM film WHERE film_id IN (1,2,3,4,5)
  3. SELECT * FROM film WHERE film_id =1 or film_id=2 or film_id=3 or film_id=4 or film_id=5

What is the difference?

The optimizer trace output shows that the first query executes as one range, while the second and third execute as 5 separate single-value ranges:

                  "chosen_range_access_summary": {
                    "range_access_plan": {
                      "type": "range_scan",
                      "index": "PRIMARY",
                      "rows": 5,
                      "ranges": [
                        "1 <= film_id <= 1",
                        "2 <= film_id <= 2",
                        "3 <= film_id <= 3",
                        "4 <= film_id <= 4",
                        "5 <= film_id <= 5"
                      ]
                    },
                    "rows_for_plan": 5,
                    "cost_for_plan": 6.0168,
                    "chosen": true
                  }

This can also be confirmed with the handler counts from SHOW STATUS:

BETWEEN 1 AND 5:
 Handler_read_key: 1
 Handler_read_next: 5
IN (1,2,3,4,5):
 Handler_read_key: 5
film_id =1 or film_id=2 or film_id=3 or film_id=4 or film_id=5:
 Handler_read_key: 5

So I would say that BETWEEN 1 AND 5 is the cheapest query, because it finds one key and then says next, next, next until finished. The optimizer seems to agree with me. A single range access plus next five times costs 2.0168 instead of 6.0168:

                  "chosen_range_access_summary": {
                    "range_access_plan": {
                      "type": "range_scan",
                      "index": "PRIMARY",
                      "rows": 5,
                      "ranges": [
                        "1 <= film_id <= 5"
                      ]
                    },
                    "rows_for_plan": 5,
                    "cost_for_plan": 2.0168,
                    "chosen": true
                  }
                }

For context, a cost unit is a logical representation of approximately one random IO. It is stable to compare costs between different execution plans.

Ranges are not all equal

Perhaps a better example to demonstrate this, is the difference between these two ranges:

  • SELECT * FROM film WHERE film_id BETWEEN 1 and 20
  • SELECT * FROM film WHERE (film_id BETWEEN 1 and 10) or (film_id BETWEEN 911 and 920)

It's pretty obvious that the second one needs to execute in two separate ranges. EXPLAIN will not show this difference, and both queries appear the same:

********* 1. row *********
           id: 1
  select_type: SIMPLE
        table: film
   partitions: NULL
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 2
          ref: NULL
         rows: 20
     filtered: 100.00
        Extra: Using where

Two distinct separate ranges may be two separate pages, and thus have different cache efficiency on the buffer pool. It should be possible to distinguish between the two.

EXPLAIN FORMAT=JSON

EXPLAIN FORMAT=JSON was introduced in MySQL 5.6 along with OPTIMIZER TRACE, but where it really becomes useful is MySQL 5.7. The JSON output will now include cost information (as well as showing separate ranges as attached_condition):

********* 1. row *********
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "10.04"
    },
    "table": {
      "table_name": "film",
      "access_type": "range",
      "possible_keys": [
        "PRIMARY"
      ],
      "key": "PRIMARY",
      "used_key_parts": [
        "film_id"
      ],
      "key_length": "2",
      "rows_examined_per_scan": 20,
      "rows_produced_per_join": 20,
      "filtered": "100.00",
      "cost_info": {
        "read_cost": "6.04",
        "eval_cost": "4.00",
        "prefix_cost": "10.04",
        "data_read_per_join": "15K"
      },
      "used_columns": [
        "film_id",
        "title",
        "description",
        "release_year",
        "language_id",
        "original_language_id",
        "rental_duration",
        "rental_rate",
        "length",
        "replacement_cost",
        "rating",
        "special_features",
        "last_update"
      ],
      "attached_condition": "((`film`.`film_id` between 1 and 10) or (`film`.`film_id` between 911 and 920))"
    }
  }
}

With the FORMAT=JSON output also showing cost, we can see that two ranges costs 10.04, versus one big range costing 9.04 (not shown). These queries are not identical in cost even though they are in EXPLAIN output.

Conclusion

I have heard many users say "joins are slow", but a broad statement like this misses magnitude. By including the cost information in EXPLAIN we get all users to speak the same language. We can now say "this join is expensive", which is a much better distinction ๐Ÿ™‚

It is time to start using OPTIMIZER TRACE, and particularly in 5.7 ditch EXPLAIN for EXPLAIN FORMAT=JSON.

Testing the UNION ALL Optimization in MySQL 5.7 DMR3

When MySQL 5.7 DMR3 was released, I couldn’t wait to try out the new UNION ALL optimization that no longer requires data to be materialized in a temporary table.
This new optimization can be seen via EXPLAIN, but it is also nice to run it through SHOW PROFILES, which breaks down query execution step by step. However, this feature is now deprecated (since it overlaps with performance_schema), and will be removed in a future version.
So today I wanted to show you a combination of:
* What a UNION ALL statement looks like in MySQL 5.6 (EXPLAIN, SHOW PROFILES).
* How it is improved in MySQL 5.7 (EXPLAIN, SHOW PROFILES).
* How you can easily emulate the SHOW PROFILES feature with performance_schema + ps_helper ๐Ÿ™‚

Initial Setup

The UNION ALL query I am going to demonstrate uses the following fake sample data:

use test;
CREATE TABLE `table_a` (
  `col1` int(11) NOT NULL AUTO_INCREMENT,
  `col2` char(255) DEFAULT NULL,
  PRIMARY KEY (`col1`)
) ENGINE=InnoDB;
CREATE TABLE `table_b` (
  `col1` int(11) NOT NULL AUTO_INCREMENT,
  `col2` char(255) DEFAULT NULL,
  PRIMARY KEY (`col1`)
) ENGINE=InnoDB;
INSERT INTO table_a (col2) VALUES ('A'), ('AA'), ('AAA');
INSERT INTO table_b (col2) VALUES ('B'), ('BB'), ('BBB');

The query that I want to use is this one:

mysql> SELECT * FROM table_a UNION ALL SELECT * FROM table_b;
+------+------+
| col1 | col2 |
+------+------+
|    1 | A    |
|    2 | AA   |
|    3 | AAA  |
|    1 | B    |
|    2 | BB   |
|    3 | BBB  |
+------+------+
6 rows in set (0.00 sec)

MySQL 5.6

We can see from EXPLAIN that MySQL 5.6 requires a step to insert the rows into a temporary table, from which the results can be returned:

mysql> EXPLAIN SELECT * FROM table_a UNION ALL SELECT * FROM table_b\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: table_a
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
        Extra: NULL
*************************** 2. row ***************************
           id: 2
  select_type: UNION
        table: table_b
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
        Extra: NULL
*************************** 3. row ***************************
           id: NULL
  select_type: UNION RESULT
        table: <union1,2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Using temporary
3 rows in set (0.00 sec)

Via SHOW PROFILES, we can also see there are multiple stages of Sending data. I believe the best way to describe this state as ‘shipping rows around in the server and between storage engines’:

mysql> set profiling=1;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show profiles;
+----------+------------+-------------------------------------------------------+
| Query_ID | Duration   | Query                                                 |
+----------+------------+-------------------------------------------------------+
|        1 | 0.00039200 | SELECT * FROM table_a UNION ALL SELECT * FROM table_b |
+----------+------------+-------------------------------------------------------+
2 rows in set, 1 warning (0.01 sec)
mysql> show profile for query 1;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000059 |
| checking permissions | 0.000005 |
| checking permissions | 0.000005 |
| Opening tables       | 0.000091 |
| System lock          | 0.000011 |
| optimizing           | 0.000004 |
| statistics           | 0.000012 |
| preparing            | 0.000009 |
| optimizing           | 0.000002 |
| statistics           | 0.000005 |
| preparing            | 0.000004 |
| executing            | 0.000003 |
| Sending data         | 0.000053 |
| executing            | 0.000002 |
| Sending data         | 0.000029 |
| optimizing           | 0.000006 |
| statistics           | 0.000006 |
| preparing            | 0.000004 |
| executing            | 0.000002 |
| Sending data         | 0.000019 |
| removing tmp table   | 0.000007 |
| Sending data         | 0.000002 |
| query end            | 0.000007 |
| closing tables       | 0.000010 |
| freeing items        | 0.000019 |
| cleaning up          | 0.000016 |
+----------------------+----------+
26 rows in set, 1 warning (0.00 sec)

MySQL 5.7

The output from EXPLAIN now shows the query executing in two steps instead of three. The warning here is benign. It is caused by EXPLAIN EXTENDED now being permanently enabled:

mysql> EXPLAIN SELECT * FROM table_a UNION ALL SELECT * FROM table_b\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: table_a
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 2
  select_type: UNION
        table: table_b
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
     filtered: 100.00
        Extra: NULL
2 rows in set, 1 warning (0.00 sec)

And using SHOW PROFILES we can see that there are now only two Sending data steps:

mysql> set profiling=1;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show profiles;
+----------+------------+-------------------------------------------------------+
| Query_ID | Duration   | Query                                                 |
+----------+------------+-------------------------------------------------------+
|        1 | 0.00038900 | SELECT * FROM table_a UNION ALL SELECT * FROM table_b |
+----------+------------+-------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> show profile for query 1;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000094 |
| checking permissions | 0.000013 |
| checking permissions | 0.000008 |
| Opening tables       | 0.000068 |
| System lock          | 0.000012 |
| optimizing           | 0.000015 |
| statistics           | 0.000012 |
| preparing            | 0.000010 |
| optimizing           | 0.000003 |
| statistics           | 0.000005 |
| preparing            | 0.000006 |
| executing            | 0.000002 |
| Sending data         | 0.000046 |
| executing            | 0.000003 |
| Sending data         | 0.000014 |
| query end            | 0.000010 |
| removing tmp table   | 0.000012 |
| query end            | 0.000003 |
| closing tables       | 0.000010 |
| freeing items        | 0.000027 |
| cleaning up          | 0.000016 |
+----------------------+----------+
21 rows in set, 1 warning (0.00 sec)

You will notice that there is in fact still a temporary table, denoted by the step removing tmp table. The release notes explain this point very carefully:

“The server no longer uses a temporary table for UNION statements that
meet certain qualifications. Instead, it retains from temporary table
creation only the data structures necessary to perform result column
typecasting. The table is not fully instantiated and no rows are
written to or read from it; rows are sent directly to the client. As
a result, The result is reduced memory and disk requirements, and
smaller delay before the first row is sent to the client because the
server need not wait until the last query block is executed. EXPLAIN
and optimizer trace output will change: The UNION RESULT query block
will not be present because that block is the part that reads from
the temporary table.”

Example using ps_helper

And now to demonstrate how to replace SHOW PROFILES with performance_schema! To do this, I decided to base my scripts on ps_helper by Mark Leith.
There is no specific reason you need to do this, but I admire the way ps_helper works and its something I’ve wanted to try extending for a while. It also includes useful helper functions to intelligently truncate/format SQL, and convert times to a human readable format.
So the steps are:

Not specifically a limitation of performance_schema, but one small difference with my script is that it is enabled globally as follows:

call ps_helper.enable_profiling();

In any new session, you can then use the profiling feature as follows:

mysql> SELECT * FROM table_a UNION ALL SELECT * FROM table_b;
+------+------+
| col1 | col2 |
+------+------+
|    1 | A    |
|    2 | AA   |
|    3 | AAA  |
|    1 | B    |
|    2 | BB   |
|    3 | BBB  |
+------+------+
6 rows in set (0.00 sec)
mysql> call ps_helper.show_profiles();
+----------+-----------+-------------------------------------------------------+
| Event_ID | Duration  | Query                                                 |
+----------+-----------+-------------------------------------------------------+
|       58 | 58.57 us  | select @@version_comment limit 1                      |
|       70 | 73.19 us  | select USER()                                         |
|       82 | 297.47 us | SELECT * FROM table_a UNION ALL SELECT * FROM table_b |
+----------+-----------+-------------------------------------------------------+
3 rows in set (0.04 sec)
Query OK, 0 rows affected (0.04 sec)
mysql> call ps_helper.show_profile_for_event_id(82);
+----------------------+-----------+
| Status               | Duration  |
+----------------------+-----------+
| init                 | 55.55 us  |
| checking permissions | 2.15 us   |
| checking permissions | 3.29 us   |
| Opening tables       | 68.30 us  |
| System lock          | 7.43 us   |
| optimizing           | 2.34 us   |
| statistics           | 9.97 us   |
| preparing            | 7.53 us   |
| optimizing           | 606.00 ns |
| statistics           | 2.99 us   |
| preparing            | 2.62 us   |
| executing            | 591.00 ns |
| Sending data         | 73.23 us  |
| executing            | 484.00 ns |
| Sending data         | 23.93 us  |
| query end            | 7.46 us   |
| removing tmp table   | 3.37 us   |
| closing tables       | 7.68 us   |
| freeing items        | 15.27 us  |
| cleaning up          | 804.00 ns |
+----------------------+-----------+
20 rows in set (0.00 sec)
Query OK, 0 rows affected (0.00 sec)

I renamed the command SHOW PROFILE FOR QUERY x to show_profile_for_event_id, since this better maps to performance_schema naming, but otherwise it behaves pretty much the same.
For some reason, there is a second query end state in the SHOW PROFILES version after removing the temporary table. I’m not sure why it is, and if not including it will be an issue.

Conclusion

It’s great to see these optimizations introduced into MySQL – I can think of a number of users who can benefit from a better UNION ALL.
I implemented my own SHOW PROFILES feature really as an educational step for myself to make sure I knew performance_schema, and it was actually quite a quick and joyful experience. I plan to work with the maintainer of ps_helper and see if this can be included in future versions.

EXPLAIN PARTITIONS and EXPLAIN EXTENDED deprecation

In MySQL 5.7 we are planning to deprecate the syntax:

EXPLAIN PARTITIONS <insert query here>
EXPLAIN EXTENDED <insert query here>

.. and enable these two options by default.
The rationale is that:

  • Simple and consistent is always better. EXPLAIN FORMAT=JSON already behaves like these two flags are enabled, and if you have a partitioned table for example, it is unlikely that you would not want the PARTITIONS option. Having to remember two more flags makes the product harder to use.

  • The optimizer team has been busy refactoring and improving code quality. These two flags are supported by many if-statements, increasing complexity by more than we would like.

The intended deprecation plan is to automatically turn both flags on in MySQL 5.7 and issue deprecation warnings when using the older syntax. In MySQL 5.8, the syntax will be removed.

What do I expect will break?

That’s a good question, since EXPLAIN is really a DBA tool, and is unlikely to affect running applications. I expect it to be:

  • Automated tools that depend on column order of EXPLAIN, or EXPLAIN not producing a warning (as of MySQL 5.7).

  • Automated tools that explicitly use EXPLAIN PARTITIONS or EXPLAIN EXTENDED (as of MySQL 5.8).

Both I hope should be fairly simple fixes.

How can I make my tools more resilient?

While we’re on the subject of EXPLAIN and automated tools, it is a good segway to lead into EXPLAIN FORMAT=JSON, which is very well suited here. I would expect that with JSON output, applications are more likely to be able to tolerate changes to the information returned and in MySQL 5.7 the JSON is already much more detailed.

Optimizer Enhancements in MySQL 5.7

The MySQL optimizer is getting better. MySQL 5.6 introduced:

  • File sort optimizations with small limit
  • Index Condition Pushdown
  • Batched Key Access and Multi Range Read
  • Postponed Materialization
  • Improved Subquery execution
  • EXPLAIN for Insert, Update, and Delete
  • Optimizer Traces
  • Structured EXPLAIN in JSON format

This was in addition to the InnoDB storage engine now offering improved statistics collection, leading to more stable query plans.
In Evgeny Potemkin’s session at MySQL Connect titled “MySQL’s EXPLAIN Command New Features“, two new features for 5.7 were announced. They are both incredibly useful, so I wanted to write a little about them.

EXPLAIN FOR CONNECTION

Normally with EXPLAIN, what you would be doing is finding the execution plan of a query you are intending to run, and then interpreting the output how you see fit.
What MySQL 5.7 will do, is give you the ability to see the execution plan of a running query in another connection. i.e.

EXPLAIN FORMAT=JSON FOR CONNECTION 2;

Why it’s useful:
* Plans can change depending on input parameters. i.e. WHERE mydate BETWEEN '2013-01-01' and '2013-01-02' may use an index, but WHERE mydate BETWEEN '2001-01-01' and '2013-10-17' may not.
* Plans can change as data changes.
* Plans can also change depending on the context of a transaction, with InnoDB offering multi-version concurrency control.
* Optimizer statistics can change, and it’s not impossible that the reason for the executing query being slow has something to do with it. It’s great to have conclusive proof and be able to rule this out.

Execution cost in EXPLAIN

MySQL uses cost based optimization to pick the best query execution plan when there are multiple choices available. It is very similar to how a GPS navigator adds up estimated time and picks the best route to a destination.
What this feature does is exposes the cost as a numeric value when running EXPLAIN FORMAT=JSON. To take an example using the world sample database:

mysql [localhost] {msandbox} (world) > EXPLAIN FORMAT=JSON SELECT City.*
FROM City INNER JOIN Country ON City.countrycode=Country.code
ORDER BY City.NAME ASC LIMIT 100\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "4786.00"
    },
    "ordering_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "2151.00"
      },
      "nested_loop": [
        {
          "table": {
            "table_name": "country",
            "access_type": "index",
            "possible_keys": [
              "PRIMARY"
            ],
            "key": "PRIMARY",
            "used_key_parts": [
              "Code"
            ],
            "key_length": "3",
            "rows_examined_per_scan": 239,
            "rows_produced_per_join": 239,
            "filtered": 100,
            "using_index": true,
            "cost_info": {
              "read_cost": "6.00",
              "eval_cost": "47.80",
              "prefix_cost": "53.80",
              "data_read_per_join": "61K"
            },
            "used_columns": [
              "Code"
            ]
          }
        },
        {
          "table": {
            "table_name": "City",
            "access_type": "ref",
            "possible_keys": [
              "CountryCode"
            ],
            "key": "CountryCode",
            "used_key_parts": [
              "CountryCode"
            ],
            "key_length": "3",
            "ref": [
              "world.country.Code"
            ],
            "rows_examined_per_scan": 9,
            "rows_produced_per_join": 2151,
            "filtered": 100,
            "cost_info": {
              "read_cost": "2151.00",
              "eval_cost": "430.20",
              "prefix_cost": "2635.00",
              "data_read_per_join": "151K"
            },
            "used_columns": [
              "ID",
              "Name",
              "CountryCode",
              "District",
              "Population"
            ]
          }
        }
      ]
    }
  }
}

Why it’s useful:

  • This exposes more transparency into optimizer decisions. DBAs can better understand what part of a query is considered expensive, and try to optimize. I think this is important, because I have heard a lot of DBAs make blanket recommendations like “joins are bad” or “sorting is bad”, but there needs to be context on how much data needs to be sorted. It makes us all speak the same language: estimated cost.
  • Cost refinement is an ongoing effort. As well as the introduction of new fast SSD storage, MySQL is introducing new optimizations (such as index-condition pushdown). Not all of these optimizations will be the best choice every time, and MySQL should ideally be able to make a right choice for all situations.

Optimizing IN Subqueries in MySQL 5.6

I thought I would take the new subquery optimizations in MySQL 5.6 for a spin today, using the world sample database provided by MySQL for certification and training.

Typical IN subquery

This is a very typical query developers run, which historically has performed very poorly on MySQL:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Notice that in MySQL 5.6 – the very first table accessed is Country instead of City. MySQL 5.5 was not able to recognize this as a constant, and instead executed this as a DEPENDENT SUBQUERY (aka Correlated subquery) for each row it found in the city table (an estimated 3984 rows)!
MySQL 5.6 still has a table scan on Country, but I can address that with an index on Country.name:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY,Name
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ref
possible_keys: PRIMARY,Name
          key: Name
      key_len: 52
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

The index doesn’t affect MySQL 5.5 – which still executes as a DEPENDENT SUBQUERY, but take a look at MySQL 5.6 – 1 row from the Country table (from an index!) and then 1 row from the City table. This optimizes great!

More complex IN example

In this example I thought I would try to find all cities in the country with the largest population. My first attempt was to see if I could now use a LIMIT in a subquery. It looks like I’ll have to wait a bit longer:

mysql5.5.31 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
mysql5.6.11 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So here is my second attempt:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 219
        Extra:
3 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: NULL
3 rows in set (0.00 sec)

MySQL 5.5 could always optimize the population = scalar subquery, but not the IN subquery. Similar to the above example, I would expect the subqueries here should be unraveled as constants as well. If I add an index on population you can really see this happen:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY,Population
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE country add index (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.6.11 > EXPLAIN select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ref
possible_keys: PRIMARY,Population
          key: Population
      key_len: 4
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)

This is looking really good in MySQL 5.6. I had a bit of a huh? moment when trying to read what the Select tables optimized away step #3 meant. This led me to try using EXPLAIN EXTENDED where I discovered a little gem:

mysql5.5.31 > SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`city` where
 <in_optimizer>(`world`.`city`.`CountryCode`,
 <exists>(<primary_index_lookup>(<cache>(`world`.`city`.`CountryCode`) in
 country on PRIMARY where ((`world`.`country`.`Population` =
 (select max(`world`.`country`.`Population`) from `world`.`country`)) and
 (<cache>(`world`.`city`.`CountryCode`) = `world`.`country`.`Code`)))))
1 row in set (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`country` join `world`.`city`
where ((`world`.`city`.`CountryCode` = `world`.`country`.`Code`)
and (`world`.`country`.`Population` = (/* select#3 */
select max(`world`.`country`.`Population`) from `world`.`country`)))
1 row in set (0.00 sec)

EXPLAIN EXTENDED writes the approximate query that MySQL is going to execute after the optimizer has applied any optimizations or transformations. This has been enhanced in MySQL 5.6 to add a comment for each step in the query execution (IDs match up to those in EXPLAIN). So if it was ever unclear, it is clearly this portion that is being optimized away:

mysql5.5.31 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)

I believe what’s happening here, is during optimization MySQL opens the index population and looks at the last value (very cheap on a B-tree). So it kind of cheats and does some of the work before it has to. I’ve seen it do this before, here is a more common example of this cheating happening:

mysql5.5.31 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra:
1 row in set, 1 warning (0.00 sec)
mysql5.5.31 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS `CountryCode`,
'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` where 1
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS
 `CountryCode`,'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city`
WHERE 1
1 row in set (0.00 sec)

Anyway, back to my original query. With the nesting of my IN queries I sometimes find it difficult to read the output of EXPLAIN and understand the order of execution. MySQL 5.6 also has FORMAT=JSON, which looks much nicer to me and it includes more information:

mysql5.6.11 > EXPLAIN format=json select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "nested_loop": [
      {
        "table": {
          "table_name": "country",
          "access_type": "ref",
          "possible_keys": [
            "PRIMARY",
            "Population"
          ],
          "key": "Population",
          "used_key_parts": [
            "Population"
          ],
          "key_length": "4",
          "ref": [
            "const"
          ],
          "rows": 1,
          "filtered": 100,
          "using_index": true,
          "attached_condition": "(`world`.`country`.`Population` = (/* select#3 */
          select max(`world`.`country`.`Population`) from `world`.`country`))",
          "attached_subqueries": [
            {
              "dependent": false,
              "cacheable": true,
              "query_block": {
                "select_id": 3,
                "table": {
                  "message": "Select tables optimized away"
                }
              }
            }
          ]
        }
      },
      {
        "table": {
          "table_name": "City",
          "access_type": "ref",
          "possible_keys": [
            "CountryCode"
          ],
          "key": "CountryCode",
          "used_key_parts": [
            "CountryCode"
          ],
          "key_length": "3",
          "ref": [
            "world.country.Code"
          ],
          "rows": 1,
          "filtered": 100
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.00 sec)

While it’s possible that these queries could have been rewritten to be efficient joins, I really like seeing query optimizations being introduced to eliminate common paper cuts. Improving diagnostic features doesn’t hurt either ๐Ÿ˜‰ I’m really looking forward to what tools can be built to take advantage of the JSON explain output.