How important is it to use 2-byte and 3-byte integers?

One interesting feature of MySQL, is that it supports a very large number of integer data types. From the MySQL manual:

Type Storage Minimum Value Maximum Value
(Bytes) (Signed/Unsigned) (Signed/Unsigned)
TINYINT 1 -128 127
0 255
SMALLINT 2 -32768 32767
0 65535
MEDIUMINT 3 -8388608 8388607
0 16777215
INT 4 -2147483648 2147483647
0 4294967295
BIGINT 8 -9223372036854775808 9223372036854775807
0 18446744073709551615

I have previously written on How important it is to using the correct datatype in MySQL, but I find that between integer types I don’t always follow my own advice. I predominantly use TINYINT, INT, and BIGINT, and usually as the default of SIGNED rather than UNSIGNED.

Am I lazy? I think it is best to run through examples…

Primary Key Integer

In this example lets say that we know that we will never have more than 65535 customers, so we have the choice of using a SMALLINT UNSIGNED at 2 bytes, or lazily an INT will accommodate 2147483647 customers at 4 bytes:

Option #1:
CREATE TABLE customers (
id SMALLINT UNSIGNED PRIMARY KEY auto_increment,
company_name VARCHAR(20),
..
..
last_order_date DATETIME
);
Option #2:
CREATE TABLE customers (
id INT PRIMARY KEY auto_increment,
company_name VARCHAR(20),
..
..
last_order_date DATETIME
);

I will project that there are 50000 rows in the table, which is about as close as you would want to get to the maximum range possible of SMALLINT. Using SMALLINT provides a 2 byte saving per row:

2 * 50000 = 100000 = A saving of 97KB

In InnoDB the primary key is included in every other index on the table, so technically we will save an additional ~97KB per-index, but I would say that this saving is too small to matter for most cases.

Attribute on a large table

Option #1:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL,
..
..
);
Option #2:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id INT NOT NULL,
..
..
);

In this example, I will say that there are 1 billion rows in the page_views table:

1 billion * 2 bytes saving per row = 1.8GB

So the choice of using a SMALLINT over an INT has more measurable savings in this case. On a per unindexed column basis I could be saving around 2GB. So with five SMALLINT columns on the same row, I could save 10GB.

Is 10GB worth it? It is maybe easier to answer this question when expressed as a percentage saving for a table. To use some optimistic overhead numbers (that assume very little fragmentation) for InnoDB:

  • ~13 bytes per row (MVCC features)
  • ~1KB per 16K page (page fill factor)

I’ll look at two table definitions. One with the best case of all SMALLINT columns in place of INT, and one with a VARCHAR column that is assumed to be 50 bytes on average:

Option #1:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL, /* or INT */
page_id SMALLINT UNSIGNED NOT NULL, /* or INT */
category_id SMALLINT UNSIGNED NOT NULL, /* or INT */
region_id SMALLINT UNSIGNED NOT NULL, /* or INT */
time_id SMALLINT UNSIGNED NOT NULL /* or INT */
);
Option #2:
CREATE TABLE page_views (
id BIGINT PRIMARY KEY auto_increment,
site_id SMALLINT UNSIGNED NOT NULL, /* or INT */
page_uri VARCHAR(2048) NOT NULL, /* assume average length of 50 bytes */
category_id SMALLINT UNSIGNED NOT NULL, /* or INT */
region_id SMALLINT UNSIGNED NOT NULL, /* or INT */
time_id SMALLINT UNSIGNED NOT NULL /* or INT */
);
# INT/SMALLINT only:
(13B (overhead) + 8B (pkey) + 5*2) * 1.06 * 1 billion = 30.60 GB
(13B (overhead) + 8B (pkey) + 5*4) * 1.06 * 1 billion = 40.47 GB
# One VARCHAR column:
(13B (overhead) + 8B (pkey) + 4*2 + 50 (varchar)) * 1.06 * 1 billion = 77.9GB
(13B (overhead) + 8B (pkey) + 4*4 + 50 (varchar)) * 1.06 * 1 billion = 85.9GB

So in the INT/SMALLINT table there is about a 25% saving by using smaller data types, and about a 10% saving in the case of the single VARCHAR column.

I do think that 25% is worth caring about, so maybe in this case I could follow my own advice a little closer 🙂 However, I would also consider this a close to best-case situation that I have used to illustrate a point. I would expect typical space savings to be less than 10%.

One other point to consider, is that changing the datatype of a column is not supported by Online DDL. It is important not to sub-optimize data type choices and have to make changes later.

Did I miss anything? Please leave a comment.