We’ve all gotten the dreaded email/call from a user stating that a query is “slow sometimes”. If you’re lucky, the “sometimes” actually ends up being fairly consistent and you can fairly easily determine what’s happening (an errant cron job, for example). All too often though, the issue really is sporadic, fleeting, and indeterministic. So how do you track these down? And more importantly what do you do about them once found?
For starters, you as the DBA should have your PostgreSQL logging configured to log these slow performing queries. After all, you and the devs and the users can agree that all queries should complete in some measure of time (1 sec, 1 minute, etc). So, once you know what this acceptable elapsed time is, you can easily log any query that runs longer by just setting this in your postgresql.conf
:
log_min_duration_statement = 1000 # log anything running longer than 1s
And now, you have all queries with long run times logged automatically. And these show up nicely in your pgBadger reports too!
If you’re lucky, you’ll be able to use EXPLAIN
to see why the query is behaving poorly. However, if your life is like mine, the explain plan will be reasonable and won’t have any smoking guns to speak of. Which means the performance is either load dependent or being influenced by other processes (something is blowing out your caches, for example). In these cases, what you really need is the EXPLAIN
output from the very instant that it performed poorly. However, you can’t go back in time to get it. But what you can do is make use of the auto_explain
module that ships with PostgreSQL.
In case the name wasn’t obvious enough, the auto_explain
module causes PostgreSQL to automatically run EXPLAIN
on queries according to thresholds that you configure. These automatically generated plans are then logged into the normal PostgreSQL logs. Let’s walk through setting it up and see how it works.
First, in your postgresql.conf
we want to enable the module:
shared_preload_libraries = 'auto_explain' # change requires restart
As stated, you will have to restart the postmaster to get the module to load. However, let’s configure it in postgresql.conf
first:
# Add settings for extensions here
#
# auto_explain
# http://www.postgresql.org/docs/current/static/auto-explain.html
auto_explain.log_analyze = true
auto_explain.log_timing = true
auto_explain.log_verbose = true
auto_explain.log_min_duration = '1000ms'
auto_explain.log_nested_statements = true
auto_explain.log_buffers = true
# auto_explain
What we’ve done here is configure auto_explain
to
EXPLAIN ANALYZE
1TIMING
option of EXPLAIN
VERBOSE
option of EXPLAIN
log_min_duration_statement
, above)BUFFERS
option of EXPLAIN
As with most GUC
in PostgreSQL, these can all be changed using SET
in a given session, but we’re setting the defaults here. Now that we have them setup, let’s see what it looks like in practice.
> CREATE TABLE x(t text);
CREATE TABLE
Time: 6.022 ms
> INSERT INTO x(t) SELECT generate_series(1,10000);
INSERT 0 10000
Time: 23.565 ms
We connected to PostgreSQL, created a test table, and then used generate_series
to insert 10k rows. In our logs, the following were added:
2016-11-28 13:20:30 EST [28838]: [18-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 33.987 ms statement: CREATE TABLE x(t text);
2016-11-28 13:20:59 EST [28838]: [19-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 16.461 ms plan:
Query Text: INSERT INTO x(t) SELECT generate_series(1,10000);
Insert on public.x (cost=0.00..50.02 rows=1000 width=32) (actual time=16.459..16.459 rows=0 loops=1)
Buffers: shared hit=10085 read=47 dirtied=45
I/O Timings: read=0.012
-> Subquery Scan on "*SELECT*" (cost=0.00..50.02 rows=1000 width=32) (actual time=0.010..4.755 rows=10000 loops=1)
Output: "*SELECT*".generate_series
-> Result (cost=0.00..15.02 rows=1000 width=4) (actual time=0.007..1.364 rows=10000 loops=1)
Output: generate_series(1, 10000)
2016-11-28 13:20:59 EST [28838]: [20-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 23.374 ms statement: INSERT INTO x(t) SELECT generate_series(1,10000);
2016-11-28 13:21:00 EST [30079]: [1-1] user=,db=,app=,client= LOG: automatic analyze of table "doug.public.x" system usage: CPU 0.00s/0.11u sec elapsed 0.14 sec
(Note that for illustrative purposes, I issued SET auto_explain.log_min_duration = '0ms'
)
So, you can see that the CREATE TABLE
didn’t log anything through the auto_explain
module, but the INSERT INTO
did. This is a boring example, so let’s try a SELECT
against our table:
> SELECT * FROM x ORDER BY t LIMIT 10;
┌───────┐
│ t │
├───────┤
│ 1 │
│ 10 │
│ 100 │
│ 1000 │
│ 10000 │
│ 1001 │
│ 1002 │
│ 1003 │
│ 1004 │
│ 1005 │
└───────┘
(10 rows)
Time: 11.982 ms
and the logs look like:
2016-11-28 13:27:38 EST [322]: [7-1] user=,db=,app=,client= LOG: checkpoint starting: time
2016-11-28 13:27:46 EST [322]: [8-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 75 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=7.569 s, sync=0.092 s, total=7.920 s; sync files=23, longest=0.092 s, average=0.004 s; distance=685 kB, estimate=685 kB
2016-11-28 13:28:48 EST [28838]: [21-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 11.120 ms plan:
Query Text: SELECT * FROM x ORDER BY t LIMIT 10;
Limit (cost=561.10..561.12 rows=10 width=4) (actual time=11.073..11.073 rows=10 loops=1)
Output: t
Buffers: shared hit=45
-> Sort (cost=561.10..586.10 rows=10000 width=4) (actual time=11.072..11.072 rows=10 loops=1)
Output: t
Sort Key: x.t
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=45
-> Seq Scan on public.x (cost=0.00..345.00 rows=10000 width=4) (actual time=0.018..1.224 rows=10000 loops=1)
Output: t
Buffers: shared hit=45
2016-11-28 13:28:48 EST [28838]: [22-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 11.813 ms statement: SELECT * FROM x ORDER BY t LIMIT 10;
(You can safely ignore the checkpoint lines at the top there)
There you have both the statement we ran, and the full EXPLAIN
plan. You can see we did a sequential scan on the table (looks like it was all in the shared_buffers
too) and then we passed that up to a sort
node for an in-memory sort, and then passed that result set up to the limit
node.
While this is a stupid simple example, I hope you can see that having this in production for large, complicated queries will allow you to better diagnose issues. For example, simply doing a manual EXPLAIN ANALYZE
on the same query and seeing that you get a different plan is potentially enough to rule out (or in) certain culprits for the intermittent performance issue.
1 - This option causes the postmaster to collect info on every statement executed, even if auto_explain
isn’t going to log it. It has a measurable impact on overall performance. Please test on your workload and decide for yourself if the overhead is worth the trade-off