Optimizing IN Subqueries in MySQL 5.6

I thought I would take the new subquery optimizations in MySQL 5.6 for a spin today, using the world sample database provided by MySQL for certification and training.

Typical IN subquery

This is a very typical query developers run, which historically has performed very poorly on MySQL:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT code FROM Country WHERE name = 'United States');
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

Notice that in MySQL 5.6 – the very first table accessed is Country instead of City. MySQL 5.5 was not able to recognize this as a constant, and instead executed this as a DEPENDENT SUBQUERY (aka Correlated subquery) for each row it found in the city table (an estimated 3984 rows)!
MySQL 5.6 still has a table scan on Country, but I can address that with an index on Country.name:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: Country
         type: unique_subquery
possible_keys: PRIMARY,Name
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE Country ADD INDEX (name);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
 (SELECT code FROM Country WHERE name = 'United States')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Country
         type: ref
possible_keys: PRIMARY,Name
          key: Name
      key_len: 52
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.Country.Code
         rows: 1
        Extra: NULL
2 rows in set (0.00 sec)

The index doesn’t affect MySQL 5.5 – which still executes as a DEPENDENT SUBQUERY, but take a look at MySQL 5.6 – 1 row from the Country table (from an index!) and then 1 row from the City table. This optimizes great!

More complex IN example

In this example I thought I would try to find all cities in the country with the largest population. My first attempt was to see if I could now use a LIMIT in a subquery. It looks like I’ll have to wait a bit longer:

mysql5.5.31 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
mysql5.6.11 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So here is my second attempt:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 219
        Extra:
3 rows in set (0.00 sec)
mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: Country
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 239
        Extra: NULL
3 rows in set (0.00 sec)

MySQL 5.5 could always optimize the population = scalar subquery, but not the IN subquery. Similar to the above example, I would expect the subqueries here should be unraveled as constants as well. If I add an index on population you can really see this happen:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN
(SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3984
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: country
         type: unique_subquery
possible_keys: PRIMARY,Population
          key: PRIMARY
      key_len: 3
          ref: func
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)
mysql5.6.11 > ALTER TABLE country add index (population);
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql5.6.11 > EXPLAIN select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: country
         type: ref
possible_keys: PRIMARY,Population
          key: Population
      key_len: 4
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: City
         type: ref
possible_keys: CountryCode
          key: CountryCode
      key_len: 3
          ref: world.country.Code
         rows: 1
        Extra: NULL
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
3 rows in set (0.00 sec)

This is looking really good in MySQL 5.6. I had a bit of a huh? moment when trying to read what the Select tables optimized away step #3 meant. This led me to try using EXPLAIN EXTENDED where I discovered a little gem:

mysql5.5.31 > SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`city` where
 <in_optimizer>(`world`.`city`.`CountryCode`,
 <exists>(<primary_index_lookup>(<cache>(`world`.`city`.`CountryCode`) in
 country on PRIMARY where ((`world`.`country`.`Population` =
 (select max(`world`.`country`.`Population`) from `world`.`country`)) and
 (<cache>(`world`.`city`.`CountryCode`) = `world`.`country`.`Code`)))))
1 row in set (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`,
`world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`,
`world`.`city`.`Population` AS `Population` from `world`.`country` join `world`.`city`
where ((`world`.`city`.`CountryCode` = `world`.`country`.`Code`)
and (`world`.`country`.`Population` = (/* select#3 */
select max(`world`.`country`.`Population`) from `world`.`country`)))
1 row in set (0.00 sec)

EXPLAIN EXTENDED writes the approximate query that MySQL is going to execute after the optimizer has applied any optimizations or transformations. This has been enhanced in MySQL 5.6 to add a comment for each step in the query execution (IDs match up to those in EXPLAIN). So if it was ever unclear, it is clearly this portion that is being optimized away:

mysql5.5.31 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)

I believe what’s happening here, is during optimization MySQL opens the index population and looks at the last value (very cheap on a B-tree). So it kind of cheats and does some of the work before it has to. I’ve seen it do this before, here is a more common example of this cheating happening:

mysql5.5.31 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra:
1 row in set, 1 warning (0.00 sec)
mysql5.5.31 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS `CountryCode`,
'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` where 1
1 row in set (0.00 sec)
mysql5.6.11 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: City
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)
mysql5.6.11 > show warnings\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS
 `CountryCode`,'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city`
WHERE 1
1 row in set (0.00 sec)

Anyway, back to my original query. With the nesting of my IN queries I sometimes find it difficult to read the output of EXPLAIN and understand the order of execution. MySQL 5.6 also has FORMAT=JSON, which looks much nicer to me and it includes more information:

mysql5.6.11 > EXPLAIN format=json select * from City WHERE CountryCode IN
(SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "nested_loop": [
      {
        "table": {
          "table_name": "country",
          "access_type": "ref",
          "possible_keys": [
            "PRIMARY",
            "Population"
          ],
          "key": "Population",
          "used_key_parts": [
            "Population"
          ],
          "key_length": "4",
          "ref": [
            "const"
          ],
          "rows": 1,
          "filtered": 100,
          "using_index": true,
          "attached_condition": "(`world`.`country`.`Population` = (/* select#3 */
          select max(`world`.`country`.`Population`) from `world`.`country`))",
          "attached_subqueries": [
            {
              "dependent": false,
              "cacheable": true,
              "query_block": {
                "select_id": 3,
                "table": {
                  "message": "Select tables optimized away"
                }
              }
            }
          ]
        }
      },
      {
        "table": {
          "table_name": "City",
          "access_type": "ref",
          "possible_keys": [
            "CountryCode"
          ],
          "key": "CountryCode",
          "used_key_parts": [
            "CountryCode"
          ],
          "key_length": "3",
          "ref": [
            "world.country.Code"
          ],
          "rows": 1,
          "filtered": 100
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.00 sec)

While it’s possible that these queries could have been rewritten to be efficient joins, I really like seeing query optimizations being introduced to eliminate common paper cuts. Improving diagnostic features doesn’t hurt either 😉 I’m really looking forward to what tools can be built to take advantage of the JSON explain output.