How important is it to use 2-byte and 3-byte integers?

One interesting feature of MySQL, is that it supports a very large number of integer data types. From the MySQL manual:

Type Storage Minimum Value Maximum Value
(Bytes) (Signed/Unsigned) (Signed/Unsigned)
TINYINT 1 -128 127
0 255
SMALLINT 2 -32768 32767
0 65535
MEDIUMINT 3 -8388608 8388607
0 16777215
INT 4 -2147483648 2147483647
0 4294967295
BIGINT 8 -9223372036854775808 9223372036854775807
0 18446744073709551615

I have previously written on How important it is to using the correct datatype in MySQL, but I find that between integer types I don’t always follow my own advice. I predominantly use TINYINT, INT, and BIGINT, and usually as the default of SIGNED rather than UNSIGNED.

Am I lazy? I think it is best to run through examples…

Primary Key Integer

In this example lets say that we know that we will never have more than 65535 customers, so we have the choice of using a SMALLINT UNSIGNED at 2 bytes, or lazily an INT will accommodate 2147483647 customers at 4 bytes:

Option #1:
CREATE TABLE customers (
id SMALLINT UNSIGNED PRIMARY KEY auto_increment,
company_name VARCHAR(20),
..
..
last_order_date DATETIME
);
Option #2:
CREATE TABLE customers (
id INT PRIMARY KEY auto_increment,
company_name VARCHAR(20),
..
..
last_order_date DATETIME
);

I will project that there are 50000 rows in the table, which is about as close as you would want to get to the maximum range possible of SMALLINT. Using SMALLINT provides a 2 byte saving per row:

2 * 50000 = 100000 = A saving of 97KB

In InnoDB the primary key is included in every other index on the table, so technically we will save an additional ~97KB per-index, but I would say that this saving is too small to matter for most cases.

Attribute on a large table

Option #1:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL,
..
..
);
Option #2:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id INT NOT NULL,
..
..
);

In this example, I will say that there are 1 billion rows in the page_views table:

1 billion * 2 bytes saving per row = 1.8GB

So the choice of using a SMALLINT over an INT has more measurable savings in this case. On a per unindexed column basis I could be saving around 2GB. So with five SMALLINT columns on the same row, I could save 10GB.

Is 10GB worth it? It is maybe easier to answer this question when expressed as a percentage saving for a table. To use some optimistic overhead numbers (that assume very little fragmentation) for InnoDB:

  • ~13 bytes per row (MVCC features)
  • ~1KB per 16K page (page fill factor)

I’ll look at two table definitions. One with the best case of all SMALLINT columns in place of INT, and one with a VARCHAR column that is assumed to be 50 bytes on average:

Option #1:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL, /* or INT */
page_id SMALLINT UNSIGNED NOT NULL, /* or INT */
category_id SMALLINT UNSIGNED NOT NULL, /* or INT */
region_id SMALLINT UNSIGNED NOT NULL, /* or INT */
time_id SMALLINT UNSIGNED NOT NULL /* or INT */
);
Option #2:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL, /* or INT */
page_uri VARCHAR(2048) NOT NULL, /* assume average length of 50 bytes */
category_id SMALLINT UNSIGNED NOT NULL, /* or INT */
region_id SMALLINT UNSIGNED NOT NULL, /* or INT */
time_id SMALLINT UNSIGNED NOT NULL /* or INT */
);
# INT/SMALLINT only:
(13B (overhead) + 8B (pkey) + 5*2) * 1.06 * 1 billion = 30.60 GB
(13B (overhead) + 8B (pkey) + 5*4) * 1.06 * 1 billion = 40.47 GB

# One VARCHAR column:
(13B (overhead) + 8B (pkey) + 4*2 + 50 (varchar)) * 1.06 * 1 billion = 77.9GB
(13B (overhead) + 8B (pkey) + 4*4 + 50 (varchar)) * 1.06 * 1 billion = 85.9GB

So in the INT/SMALLINT table there is about a 25% saving by using smaller data types, and about a 10% saving in the case of the single VARCHAR column.

I do think that 25% is worth caring about, so maybe in this case I could follow my own advice a little closer 🙂 However, I would also consider this a close to best-case situation that I have used to illustrate a point. I would expect typical space savings to be less than 10%.

One other point to consider, is that changing the datatype of a column is not supported by Online DDL. It is important not to sub-optimize data type choices and have to make changes later.

Did I miss anything? Please leave a comment.

Published by

morgo

I joined MySQL AB in 2006, left, and am now back at Oracle working on the MySQL team. I’ve also worked at Percona and InPowered.

  • When you have a small table, it may seem like the difference is inconsequential. Who cares about saving 97KB here or there in the era of Big Data? But it makes much more difference if you consider how many other big tables have foreign keys referencing a small table.

    That’s where I thought you were going with the large page_views table, which might have a foreign key to the users table. So the choice of PK data type in the users table would be really important when it’s referenced by a billion page_views rows.

    • I debated whether to discuss foreign keys and how the data types need to match on both sides 🙂

      So in a sense this post describes the ‘motivations’ of which to minimize data types rather than other restrictions that may have to be imposed.

  • Arjen Lentz

    Good advice for INT vs SMALLINT. We teach that in all our training.
    However, you’ll want to explain MEDIUMINT, as it doesn’t get mentioned beyond the title.

    Short version: InnoDB stores a MEDIUMINT in 4 bytes anyway, so you might as well make it an INT and be done with it.

    • There was some discussion on Facebook if that was the case – but it appears that MEDIUMINT is 3 bytes. Here is the first example with 1 million rows (instead of 1 billion):

      morgo@rbook:~/sandboxes/msb_5_6_17/data/test$ ls -lhS

      total 278600

      -rw-rw—- 1 morgo staff 52M 30 May 17:51 page_views_int_1m.ibd

      -rw-rw—- 1 morgo staff 44M 30 May 17:51 page_views_mediumint_1m.ibd

      -rw-rw—- 1 morgo staff 40M 30 May 17:51 page_views_smallint_1m.ibd

      -rw-rw—- 1 morgo staff 8.5K 30 May 17:51 page_views_int_1M.frm

      -rw-rw—- 1 morgo staff 8.5K 30 May 17:51 page_views_mediumint_1M.frm

      -rw-rw—- 1 morgo staff 8.5K 30 May 17:51 page_views_smallint_1M.frm

      This can also be used to project 1 billion and see how close I was:
      – 50.78 GB actual size for 1 billion rows (I estimated 40.47 GB)
      – 39.06 GB actual size for 1 billion rows (I estimated 30.60 GB)

      I’m reasonably happy with how close I was – but there’s probably more per-row overhead than I had planned.

      • Some additional discussion here:
        https://github.com/github/gh-ost/issues/353

        For some semi-proof of InnoDB buffer pool not being var-length in memory, you can see with pfs memory instrumentation in 5.7. The buffer pool is all one allocation just larger than innodb_buffer_pool_size.

  • I tend to first think of RAM then disk and finally CPU. RAM first because that still tends to be what is in most limiting supply.

  • Will choosing a smaller-byte type have any impact on the performance (speed) of MySQL? Or is it just disk size we are talking about?

    • The 3-byte integers get converted to 4-byte integers in memory, so in theory they should be slower. I tried testing this by running SELECT sum(site_id) FROM page_views_int_1million;, but I couldn’t see any measurable difference.