Commentary on MySQL slow query collection sources

The other day it struck me that MySQL applications have no fewer than four sources to be able to collect potentially slow queries for analysis, and that I actually find myself using 3/4 methods available.

Here are the methods listed, and my commentary on where I will use each of them:

Application Logging/Monitoring

I enjoy using Application-level tools such as New Relic or xhprof to find slow experiences rather than just slow queries.

What I mean by this, is that there is not a 1:1 between pages and queries, so sometimes just heading straight to the database can be the wrong decision. As I wrote on Monday, latency is a factor with N+1 queries, and I typically want to focus on queries in aggregate that exceed page generation goals. On the same note, I also pay less attention to queries that are part of background tasks and do not impact user experience.

I also like monitoring application tools like a hawk whenever I am running any operational changes (i.e. removing indexes that I am 99.99% confident are unused, disabling the query cache). New Relic has about a 3 minute delay over real-time, but this is usually good enough for me, since it increases my confidence on top of whatever DB dashboards I am already looking at.

Performance Schema

In MySQL 5.6, the performance_schema can now instrument statements. In combination with ps_helper I find this to be easiest way to identify what has been running on a particular database server.

What I love about this, is that it’s enabled by default and can stay enabled full time with a zero second slow query threshold.

Maybe it’s just anecdotal, but I also find with performance_schema that I can react very fast, and won’t miss anything. There used to be this situation where I would need to stalk the PROCESSLIST or make sure I lower the long_query_time in time to see a particular problem.

(I should also note, that the MySQL Enterprise Monitor 3.0 uses the PERFORMANCE_SCHEMA to show statement performance. The Enterprise Tools team is one of the driving forces behind PERFORMANCE_SCHEMA – it’s great to see that the interfaces they use are available to all.)

Slow query log file

I use this for more of a long play analysis, by setting the long_query_time=0, then letting it run for between 20 minutes and a few hours. When I have enough data in the log file, I will restore the long_query_time to a high value and then copy my slow log file across the network to another server for processing.

The best tool to process this file is pt-query-digest – I routinely give it a 20GB+ file, and I love the output format it provides, since it has a histogram with useful statistics.

Even with the introduction of performance_schema, there will probably always be a use for the slow query log in some form, but in 5.6 I find I am using this method less for ‘routine’ investigations. Here is why:

I counted it out – what I do is an eight step process.
1. Rotate the slow query log file (to avoid historical data skew)
2. lower the long_query_time
3. bounce applications that use connection pooling
4. record for 20 minutes (watching dashboards for impact)
5. increase the long_query_time
6. rotate the log file again
7. copy file to a different host
8. run pt-query-digest
I am worried that in one of those eight steps, I will screw up and leave the long_query_time at zero seconds. I’ll find out when I have a midnight page saying I’m critically low on disk space.

What I think would make me use this method more, is if I were able to specify a maximum size per slow log file (i.e. 2GB) and then a maximum number of log files to retain (i.e. 10) and have MySQL manage rotation and deletion of old files for me.

Slow query log table

I don’t use this method. Since the queries in the slow log table do not include the digest, it’s not possible to aggregate them in any meaningful way. What you usually have to do, is write a script to extract the slow query log table into the slow query log file format(!), then run pt-query-digest.

This feature may be helpful for cloud providers who do not want to provide local filesystem access to customers. However, at least in the case of Amazon RDS there is a way that you can access the slow query log files.

Update: A commenter on Google+ let me know that I forgot to mention network traffic inspection as an collection source. An oversight on my part. The best use case I have for this method is to workaround step (3) I mentioned in slow query log analysis – you don’t need to bounce connections to start collecting queries.

Published by

morgo

I joined MySQL AB in 2006, left, and am now back at Oracle working on the MySQL team. I’ve also worked at Percona and InPowered.

View all posts by morgo

Bill Karwin

You might be interested in a script I wrote to automate your eight-step process for collecting full slow logs. You can specify a finite period of time for it to run, then my script carefully sets all variables back to their initial values when it’s done — it even traps Ctrl-C and does the right thing. And while it’s running, it directs log output to a distinct log file.

My script is GPL and it’s called “full-slow-log” at this github repo: https://github.com/billkarwin/bk-tools

I also wrote a script to dump the contents of the slow log table in a format that mimics the file-based log. It’s called “export-slow-log-table” in the same repo. It’s not rocket science, but it may be handy to have it already done.

I don’t use the table-based logging either. It causes a high amount of overhead if you have a lot of query traffic and you drop long_query_time down to zero (up to 38% performance cost to the MySQL instance, measured by Giuseppe http://bugs.mysql.com/bug.php?id=30414).

Also, the table-based log stores query times as integers, so it truncates (i.e. floors) fractional time figures. That’s pretty useless for producing digest reports. The thing I never understood was that the function that writes to the log table actually takes a float as input, but deliberately casts it to an integer. I checked 5.6, and although the logging code has been refactored, it still truncates query times.
- Morgan Tocker
  
  Thanks for the links! Love that the scripts are just shell scripts – it certainly makes it easier to install on foreign systems.
  
  Small note on the overhead w/table-based logging: it was greatly reduced in 5.5. The 5.1 manual has a special note in it making reference to the overhead.
- Vitaly Karasik
  
  Bill, is your “export-slow-log-table” script may be used for converting CSV [export of slow table] to logfile?
  thank you
  Vitaly
  - Bill Karwin
    
    Hi Vitaly, my script is just an SQL query that you can use against the slow log table. One would use my script *instead* of exporting to CSV.
    
    It does not read CSV files. If one were to write a script to do that, I’d recommend to use perl or even awk for that.
Keith W.

Morgan, how do you go about getting access to the slow log files on Amazon RDS? I was talking to someone in the office just yesterday about the slow log table truncating the values and how I wanted access to our ‘raw’ slow log files.
- Morgan Tocker
  
  It was released in March this year: http://aws.typepad.com/aws/2013/03/amazon-rds-access-to-database-log-files.html
  - Keith W.
    
    Ah, I had seen that earlier this year but was under the impression that if you had the table active those logs would not be available to you. Thank you.
    - Morgan Tocker
      
      Normally you would have one or the other enabled (log or table), but I was just checking the manual and it suggests you can have both: http://dev.mysql.com/doc/refman/5.6/en/log-destinations.html
Arjen Lentz

Oh dude you so know better.
You don’t just want to know about a query, but also WHY it might be slow.
Information on whether it did a table scan on first table, secondary tables, did a file sort in memory or on disk, used temporary tables in memory or on disk, etc.

Grabbing such info out of the server through an intermediate tool (such as the enterprise query analyser does), or via other tweaks like NewRelic, it’s unnecessary overhead. Don’t get me wrong – we use NewRelic, but not specifically for that purpose. One should use the correct tool for the job.

For years already there have been the patches, by Google, Percona, Open Query and others, in MySQL 5.0 OurDelta builds and now MariaDB and Percona server, to log such information directly with the slow queries themselves. Since the server already has the information, there is no additional overhead.

I regard it as downright rude to insist that users install extra tools or incur additional overhead on their production environment, just because the vendor of the server software sees a commercial benefit in not properly tooling the server. This applied to MySQL AB long ago when they started the query analyser stuff, and the situation still exists today at Oracle.

As community person at Oracle now, don’t disqualify yourself as a viable discussion partner by merely sprouting marketing fluff and “good news” in this way. I’d be extremely disappointed if that’s what we get to see from you. You’re so much better and smarter than that. Don’t be used as a tool by the corporation you represent.
- Morgan Tocker
  
  In the post I said I use 3 different sources because I have 3 different modes of usage. New Relic solves a need I can not from within MySQL – which is slow experiences. Many people optimize for throughput only (and not response) which is a mistake in my opinion.
  
  If the aim is to improve throughput, then certainly the slow query log is a good source. But performance_schema can also show queries that have filesorts, temp tables on disk.
  
  No instrumentation has zero overhead. By using instrumentation you are also not looking for what is expensive, but what does not make sense. So in that regard, not all users are receiving the performance they are entitled to.
  
  I retain all editorial control of my blogging despite being an Oracle employee. Opinions stated here are my own.
  - Arjen Lentz
    
    Morgan, the key issue is, and always has been, overhead.
    
    Can you compare the server overhead induced by using PERFORMANCE_SCHEMA for the task described, compared to the ~0 overhead approach that the slow log patch delivers?
    
    Of course we can instrument the server either internally or externally in any way we like, but if it significantly hinders production environments, it’s useless.
    Sometimes there are no other options – in this particular example there are. That makes the expensive options unjustifiable for the task. Unfortunately, the ~0 option is not available in the Oracle build, is it?