We’ve all gotten the dreaded email/call from a user stating that a query is “slow sometimes”. If you’re lucky, the “sometimes” actually ends up being fairly consistent and you can fairly easily determine what’s happening (an errant cron job, for example). All too often though, the issue really is sporadic, fleeting, and indeterministic. So how do you track these down? And more importantly what do you do about them once found?
For starters, you as the DBA should have your PostgreSQL logging configured to log these slow performing queries. After all, you and the devs and the users can agree that all queries should complete in some measure of time (1 sec, 1 minute, etc). So, once you know what this acceptable elapsed time is, you can easily log any query that runs longer by just setting this in your
log_min_duration_statement = 1000 # log anything running longer than 1s
And now, you have all queries with long run times logged automatically. And these show up nicely in your pgBadger reports too!
If you’re lucky, you’ll be able to use
EXPLAIN to see why the query is behaving poorly. However, if your life is like mine, the explain plan will be reasonable and won’t have any smoking guns to speak of. Which means the performance is either load dependent or being influenced by other processes (something is blowing out your caches, for example). In these cases, what you really need is the
EXPLAIN output from the very instant that it performed poorly. However, you can’t go back in time to get it. But what you can do is make use of the
auto_explain module that ships with PostgreSQL.
In case the name wasn’t obvious enough, the
auto_explain module causes PostgreSQL to automatically run
EXPLAIN on queries according to thresholds that you configure. These automatically generated plans are then logged into the normal PostgreSQL logs. Let’s walk through setting it up and see how it works.
First, in your
postgresql.conf we want to enable the module:
shared_preload_libraries = 'auto_explain' # change requires restart
As stated, you will have to restart the postmaster to get the module to load. However, let’s configure it in
# Add settings for extensions here # # auto_explain # http://www.postgresql.org/docs/current/static/auto-explain.html auto_explain.log_analyze = true auto_explain.log_timing = true auto_explain.log_verbose = true auto_explain.log_min_duration = '1000ms' auto_explain.log_nested_statements = true auto_explain.log_buffers = true # auto_explain
What we’ve done here is configure
- to use the
- to use the
- to log the plan for anything running longer than 1 second (matches
- to include statements inside a function to also be logged
- to use the
As with most
GUC in PostgreSQL, these can all be changed using
SET in a given session, but we’re setting the defaults here. Now that we have them setup, let’s see what it looks like in practice.
(doug@[local]:5432/doug) # CREATE TABLE x(t text); CREATE TABLE Time: 6.022 ms (doug@[local]:5432/doug) # INSERT INTO x(t) SELECT generate_series(1,10000); INSERT 0 10000 Time: 23.565 ms (doug@[local]:5432/doug) #
We connected to PostgreSQL, created a test table, and then used
generate_series to insert 10k rows. In our logs, the following were added:
2016-11-28 13:20:30 EST : [18-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 33.987 ms statement: CREATE TABLE x(t text); 2016-11-28 13:20:59 EST : [19-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 16.461 ms plan: Query Text: INSERT INTO x(t) SELECT generate_series(1,10000); Insert on public.x (cost=0.00..50.02 rows=1000 width=32) (actual time=16.459..16.459 rows=0 loops=1) Buffers: shared hit=10085 read=47 dirtied=45 I/O Timings: read=0.012 -> Subquery Scan on "*SELECT*" (cost=0.00..50.02 rows=1000 width=32) (actual time=0.010..4.755 rows=10000 loops=1) Output: "*SELECT*".generate_series -> Result (cost=0.00..15.02 rows=1000 width=4) (actual time=0.007..1.364 rows=10000 loops=1) Output: generate_series(1, 10000) 2016-11-28 13:20:59 EST : [20-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 23.374 ms statement: INSERT INTO x(t) SELECT generate_series(1,10000); 2016-11-28 13:21:00 EST : [1-1] user=,db=,app=,client= LOG: automatic analyze of table "doug.public.x" system usage: CPU 0.00s/0.11u sec elapsed 0.14 sec
(Note that for illustrative purposes, I issued
SET auto_explain.log_min_duration = '0ms')
So, you can see that the
CREATE TABLE didn’t log anything through the
auto_explain module, but the
INSERT INTO did. This is a boring example, so let’s try a
SELECT against our table:
(doug@[local]:5432/doug) # SELECT * FROM x ORDER BY t LIMIT 10; ┌───────┐ │ t │ ├───────┤ │ 1 │ │ 10 │ │ 100 │ │ 1000 │ │ 10000 │ │ 1001 │ │ 1002 │ │ 1003 │ │ 1004 │ │ 1005 │ └───────┘ (10 rows) Time: 11.982 ms
and the logs look like:
2016-11-28 13:27:38 EST : [7-1] user=,db=,app=,client= LOG: checkpoint starting: time 2016-11-28 13:27:46 EST : [8-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 75 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=7.569 s, sync=0.092 s, total=7.920 s; sync files=23, longest=0.092 s, average=0.004 s; distance=685 kB, estimate=685 kB 2016-11-28 13:28:48 EST : [21-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 11.120 ms plan: Query Text: SELECT * FROM x ORDER BY t LIMIT 10; Limit (cost=561.10..561.12 rows=10 width=4) (actual time=11.073..11.073 rows=10 loops=1) Output: t Buffers: shared hit=45 -> Sort (cost=561.10..586.10 rows=10000 width=4) (actual time=11.072..11.072 rows=10 loops=1) Output: t Sort Key: x.t Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=45 -> Seq Scan on public.x (cost=0.00..345.00 rows=10000 width=4) (actual time=0.018..1.224 rows=10000 loops=1) Output: t Buffers: shared hit=45 2016-11-28 13:28:48 EST : [22-1] user=doug,db=doug,app=psql,client=[local] LOG: duration: 11.813 ms statement: SELECT * FROM x ORDER BY t LIMIT 10;
(You can safely ignore the checkpoint lines at the top there)
There you have both the statement we ran, and the full
EXPLAIN plan. You can see we did a sequential scan on the table (looks like it was all in the
shared_buffers too) and then we passed that up to a
sort node for an in-memory sort, and then passed that result set up to the
While this is a stupid simple example, I hope you can see that having this in production for large, complicated queries will allow you to better diagnose issues. For example, simply doing a manual
EXPLAIN ANALYZE on the same query and seeing that you get a different plan is potentially enough to rule out (or in) certain culprits for the intermittent performance issue.
1 - This option causes the postmaster to collect info on every statement executed, even if
auto_explain isn’t going to log it. It has a measurable impact on overall performance. Please test on your workload and decide for yourself if the overhead is worth the trade-off