The other day it struck me that MySQL applications have no fewer than four sources to be able to collect potentially slow queries for analysis, and that I actually find myself using 3/4 methods available.
Here are the methods listed, and my commentary on where I will use each of them:
What I mean by this, is that there is not a 1:1 between pages and queries, so sometimes just heading straight to the database can be the wrong decision. As I wrote on Monday, latency is a factor with N+1 queries, and I typically want to focus on queries in aggregate that exceed page generation goals. On the same note, I also pay less attention to queries that are part of background tasks and do not impact user experience.
I also like monitoring application tools like a hawk whenever I am running any operational changes (i.e. removing indexes that I am 99.99% confident are unused, disabling the query cache). New Relic has about a 3 minute delay over real-time, but this is usually good enough for me, since it increases my confidence on top of whatever DB dashboards I am already looking at.
In MySQL 5.6, the
performance_schema can now instrument statements. In combination with
ps_helper I find this to be easiest way to identify what has been running on a particular database server.
What I love about this, is that it’s enabled by default and can stay enabled full time with a zero second slow query threshold.
Maybe it’s just anecdotal, but I also find with
performance_schema that I can react very fast, and won’t miss anything. There used to be this situation where I would need to stalk the
PROCESSLIST or make sure I lower the
long_query_time in time to see a particular problem.
(I should also note, that the MySQL Enterprise Monitor 3.0 uses the
PERFORMANCE_SCHEMA to show statement performance. The Enterprise Tools team is one of the driving forces behind
PERFORMANCE_SCHEMA – it’s great to see that the interfaces they use are available to all.)
Slow query log file
I use this for more of a long play analysis, by setting the
long_query_time=0, then letting it run for between 20 minutes and a few hours. When I have enough data in the log file, I will restore the
long_query_time to a high value and then copy my slow log file across the network to another server for processing.
The best tool to process this file is
pt-query-digest – I routinely give it a 20GB+ file, and I love the output format it provides, since it has a histogram with useful statistics.
Even with the introduction of
performance_schema, there will probably always be a use for the slow query log in some form, but in 5.6 I find I am using this method less for ‘routine’ investigations. Here is why:
- I counted it out – what I do is an eight step process.
- Rotate the slow query log file (to avoid historical data skew)
- lower the
- bounce applications that use connection pooling
- record for 20 minutes (watching dashboards for impact)
- increase the
- rotate the log file again
- copy file to a different host
- I am worried that in one of those eight steps, I will screw up and leave the
long_query_timeat zero seconds. I’ll find out when I have a midnight page saying I’m critically low on disk space.
What I think would make me use this method more, is if I were able to specify a maximum size per slow log file (i.e. 2GB) and then a maximum number of log files to retain (i.e. 10) and have MySQL manage rotation and deletion of old files for me.
Slow query log table
I don’t use this method. Since the queries in the slow log table do not include the digest, it’s not possible to aggregate them in any meaningful way. What you usually have to do, is write a script to extract the slow query log table into the slow query log file format(!), then run
This feature may be helpful for cloud providers who do not want to provide local filesystem access to customers. However, at least in the case of Amazon RDS there is a way that you can access the slow query log files.
Update: A commenter on Google+ let me know that I forgot to mention network traffic inspection as an collection source. An oversight on my part. The best use case I have for this method is to workaround step (3) I mentioned in slow query log analysis – you don’t need to bounce connections to start collecting queries.