Commentary on MySQL slow query collection sources

The other day it struck me that MySQL applications have no fewer than four sources to be able to collect potentially slow queries for analysis, and that I actually find myself using 3/4 methods available.

Here are the methods listed, and my commentary on where I will use each of them:

Application Logging/Monitoring

I enjoy using Application-level tools such as New Relic or xhprof to find slow experiences rather than just slow queries.

What I mean by this, is that there is not a 1:1 between pages and queries, so sometimes just heading straight to the database can be the wrong decision. As I wrote on Monday, latency is a factor with N+1 queries, and I typically want to focus on queries in aggregate that exceed page generation goals. On the same note, I also pay less attention to queries that are part of background tasks and do not impact user experience.

I also like monitoring application tools like a hawk whenever I am running any operational changes (i.e. removing indexes that I am 99.99% confident are unused, disabling the query cache). New Relic has about a 3 minute delay over real-time, but this is usually good enough for me, since it increases my confidence on top of whatever DB dashboards I am already looking at.

Performance Schema

In MySQL 5.6, the performance_schema can now instrument statements. In combination with ps_helper I find this to be easiest way to identify what has been running on a particular database server.

What I love about this, is that it’s enabled by default and can stay enabled full time with a zero second slow query threshold.

Maybe it’s just anecdotal, but I also find with performance_schema that I can react very fast, and won’t miss anything. There used to be this situation where I would need to stalk the PROCESSLIST or make sure I lower the long_query_time in time to see a particular problem.

(I should also note, that the MySQL Enterprise Monitor 3.0 uses the PERFORMANCE_SCHEMA to show statement performance. The Enterprise Tools team is one of the driving forces behind PERFORMANCE_SCHEMA – it’s great to see that the interfaces they use are available to all.)

Slow query log file

I use this for more of a long play analysis, by setting the long_query_time=0, then letting it run for between 20 minutes and a few hours. When I have enough data in the log file, I will restore the long_query_time to a high value and then copy my slow log file across the network to another server for processing.

The best tool to process this file is pt-query-digest – I routinely give it a 20GB+ file, and I love the output format it provides, since it has a histogram with useful statistics.

Even with the introduction of performance_schema, there will probably always be a use for the slow query log in some form, but in 5.6 I find I am using this method less for ‘routine’ investigations. Here is why:

  • I counted it out – what I do is an eight step process.
    1. Rotate the slow query log file (to avoid historical data skew)
    2. lower the long_query_time
    3. bounce applications that use connection pooling
    4. record for 20 minutes (watching dashboards for impact)
    5. increase the long_query_time
    6. rotate the log file again
    7. copy file to a different host
    8. run pt-query-digest
  • I am worried that in one of those eight steps, I will screw up and leave the long_query_time at zero seconds. I’ll find out when I have a midnight page saying I’m critically low on disk space.

What I think would make me use this method more, is if I were able to specify a maximum size per slow log file (i.e. 2GB) and then a maximum number of log files to retain (i.e. 10) and have MySQL manage rotation and deletion of old files for me.

Slow query log table

I don’t use this method. Since the queries in the slow log table do not include the digest, it’s not possible to aggregate them in any meaningful way. What you usually have to do, is write a script to extract the slow query log table into the slow query log file format(!), then run pt-query-digest.

This feature may be helpful for cloud providers who do not want to provide local filesystem access to customers. However, at least in the case of Amazon RDS there is a way that you can access the slow query log files.

Update: A commenter on Google+ let me know that I forgot to mention network traffic inspection as an collection source. An oversight on my part. The best use case I have for this method is to workaround step (3) I mentioned in slow query log analysis – you don’t need to bounce connections to start collecting queries.