How important is using the correct datatype in MySQL?

Frequently in performance talks I hear speakers talk about the importance of using the correct datatypes for storing values in columns. i.e. representing a number with an INT or BIGINT, storing IP addresses as INT UNSIGNED, and VARCHAR(60) instead of VARCHAR(255).

This advice is correct, but today I thought I would try my best to go into a bit more detail đŸ™‚

The Reasons

I can think of three reasons why this optimization is true:

  1. Using numeric data types as strings incurs some added CPU overhead performing character-set and collation work. i.e. it’s not free to make 'MontrĂ©al' = 'Montreal' = 'MONTREAL', but MySQL behaves this way by default.

  2. Using correct data types will save space. By ‘space’ usually memory-fit is more important than disk fit, as it can improve cache efficiency. Disk fit can also be important with smaller SSDs.

  3. Some wire protocol and client library buffers are not variable in length. This is a little out of scope of my knowledge area, but you should expect more memory consumption with larger VARCHAR values – so going to just a little bit of effort judging expected length can help.

Memory Fit

I would rate reason (2) above as the most likely reason for why this optimization is correct. In fact, I would say that a large majority of optimizations made by DBAs are to try and stretch out memory fit for as long as possible. i.e. indexes prevent table scans,
and allow you to focus on just the data you need.

Lets take a look at memory fit improvements by concentrating on data types:

  • Short primary keys. There is a lot of space to be saved by keeping your PRIMARY KEY as short as possible. InnoDB uses a clustered index, with the value of the PRIMARY KEY also included in each secondary index as an internal row pointer.

  • Smaller values for indexed columns. Indexed values are essentially duplicated because they appear in both the index and the table row, so there is an amplification created by any inefficiency.

    How much of a memory fit improvement you will gain depends on how the index is accessed. For example:

    • SELECT DISTINCT(indexed_col) FROM my_table requires all of the index in memory to be efficient.
    • Ranged access such as SELECT * FROM my_table WHERE indexed_col BETWEEN x AND y may also require more memory fit if the ranges are wide. However, this can also be true for non ranged access (i.e. SELECT * FROM my_table WHERE indexed_col = x) if indexed_col is not unique and matches a number of rows.
    • If the index has good cardinality (i.e. a high distribution of values), and the queries executed focus on a smaller percentage of values, then memory fit may be less important.
  • Smaller values for un-indexed columns. It is also important to make sure that any un-indexed columns are small, but to a lesser degree. The case where I expect it to matter the most is if your workload includes table scans, or if there is the situation where ‘hot’ rows are highly scattered amongst a large number of pages.

    Small side-note: MySQL 5.6 allows you to change the InnoDB page size, which may help improve memory fit in this case too.

That’s all I can think of. What am I missing?

Published by

morgo

I joined MySQL AB in 2006, left, and am now back at Oracle working on the MySQL team. I’ve also worked at Percona and InPowered.

  • ronaldbradford

    A feature of MySQL is the use of UNSIGNED. So a AUTO_INCREMENT primary key of INT UNSIGNED can contain twice the rows of INT (default SIGNED). INT UNSIGNED means 4.3 billon rows are possible. People sometimes like to argue, but what if I want more then 4.3 billon rows in a table, to which I respond you will have other problems. NOTE: I am referring to AUTO_INCREMENT keys keys, there are distinct reasons for using BIGINT in other columns.

    I agree VARCHAR(15) compared to INT UNSIGNED is far better for an IP address, but everybody still does it. Using VARCHAR also eliminates the ability to range search across different masks.

    I often see people putting a MD5 into a VARCHAR(32) column. First it should be CHAR(32) as there is an overhead for VARCHAR(). Not to mention you can UNHEX() and store in BINARY(16). The question is why are you using a MD5 value. If you never view it, it’s say hash or unique constraint, save the space.

    Likewise there is an overhead for not specifying NOT NULL. My recommendation is a column should ALWAYS BE NOT NULL unless you can state a reason why the value is unknown, other than the default of NULL.

    There are lot more tricks in optimizing space, and as a byproduct introducing some data integrity, including for example ENUM. However, SQL_MODE is more important considering (and rarely set).

    I would further recommend two important things about data types.

    1. Ensure a column is the same datatype between joins in tables. This introduces issues in performance if the data types (including collations) differ.

    2. Don’t use literals for storing and/or retrieving numeric values in SQL, i.e. ColA = 1 and ColA = ‘1’ may result in the same answer,but internally the later is quite inefficient in comparison.

    • These are great additions – thank you for commenting!

      UNSIGNED vs SIGNED is an interesting argument. I actually don’t mind users only having signed, since the SQL standards and Java only have signed integers.

      NOT NULL is an important attribute, which I would consider most important on indexed columns.

      MD5s is a great example besides IP addresses of something easily compressible.

      • ronaldbradford

        While I agree the UNSIGNED/SIGNED discussion has many views, in essence; when combined with SQL_MODE; it is a form of referential integrity, i.e. said number field cannot contain a negative value. While it is not truly ANSI standard SQL it’s a product feature that does provide data integrity.