Tuning query performance in PostgreSQL

Tuning database performance - developers usually either love it or hate it. I enjoy this and want to share some of the methods that I have used recently to tune poorly executed queries in PostgreSQL. My methods are not exhaustive, but rather a textbook for those who just trudge about tuning.

Search for slow queries

The first obvious way to start tuning is to find specific operators that work poorly.

pg_stats_statements

The pg_stats_statements module is a great place to start. It simply keeps track of SQL statement execution statistics and can be an easy way to find inefficient queries.

Once you have installed this module, a system view called pg_stat_statements will be available with all its properties. Once he has the opportunity to collect enough data, look for queries that have a relatively high total_time value . Focus on these operators first.

SELECT * FROM pg_stat_statements ORDER BY total_time DESC;

user_id	dbid	queryid	query	calls	total_time
16384	16385	2948	SELECT address_1 FROM addresses a INNER JOIN people p ON a.person_id = p.id WHERE a.state = @state_abbrev;	39483	15224.670
16384	16385	924	SELECT person_id FROM people WHERE name = name ;	26483	12225.670
16384	16385	395	SELECT _ FROM orders WHERE EXISTS (select _ from products where is_featured = true)	18583	224.67

auto_explain

The auto_explain module is also useful for finding slow queries, but it has 2 obvious advantages: it registers the actual execution plan and supports recording nested statements using the log_nested_statements option. Nested statements are those statements that are executed inside a function. If your application uses many features, auto_explain is invaluable for getting detailed execution plans.

The log_min_duration option controls which query execution plans are logged based on how long they run. For example, if you set the value to 1000, all records that take longer than 1 second will be registered.

Index Tuning

Another important tuning strategy is to ensure that indexes are used correctly. As a prerequisite, we need to include the Statistics Collector.

Postgres Statistics Collector is a first-class subsystem that collects all kinds of useful performance statistics.

By enabling this collector, you get tons of pg_stat _... views that contain all the properties. In particular, I found this to be especially useful for finding missing and unused indexes.

Missing Indexes

Missing indexes may be one of the easiest solutions to improve query performance. However, they are not a silver bullet and should be used correctly (more on this later). If you have statistics collector enabled, you can run the following query ( source ).

 SELECT relname, seq_scan - idx_scan AS too_much_seq, CASE WHEN seq_scan - coalesce(idx_scan, 0) > 0 THEN 'Missing Index?' ELSE 'OK' END, pg_relation_size(relname::regclass) AS rel_size, seq_scan, idx_scan FROM pg_stat_all_tables WHERE schemaname = 'public' AND pg_relation_size(relname::regclass) > 80000 ORDER BY too_much_seq DESC;

The query finds tables that have more sequential scans (Index scans) than index scans - a clear indication that the index will help. This will not tell you which columns to create the index on, so it will take a bit more work. However, knowing which tables need them is a good first step.

Unused Indexes

Index all entities, right? Did you know that unused indexes can adversely affect write performance? The reason is that when creating an index, Postgres is burdened with the task of updating this index after write operations (INSERT / UPDATE / DELETE). Thus, adding an index is a balancing act, because it can speed up the reading of data (if it is created correctly), but it will slow down write operations. To find unused indexes, you can run the following query.

 SELECT indexrelid::regclass as index, relid::regclass as table, 'DROP INDEX ' || indexrelid::regclass || ';' as drop_statement FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) WHERE idx_scan = 0 AND indisunique is false;

Note about development environment statistics

Relying on statistics from a local development database can be problematic. Ideally, you can get the above statistics from your working machine or generate it from a restored working backup. What for? Environmental factors can change the behavior of the Postgres query optimizer. Two examples:

when the machine has less memory, PostgreSQL may not be able to perform a Hash Join, otherwise it can and will do it faster.

if there aren’t as many rows in the table (as in the development database), PostgresSQL may prefer to perform a sequential scan of the table rather than using an available index. When table sizes are small, Seq Scan can be faster. (Note: you can run

 SET enable_seqscan = OFF

in a session so that the optimizer chooses to use indexes, even if sequential scans can be faster. This is useful when working with development databases that do not have a lot of data)

Understanding Execution Plans

Now that you've found a few slow queries, it's time to start the fun.

EXPLAIN

The EXPLAIN command is certainly required when setting up queries. He tells you what is really happening. To use it, just add EXPLAIN to the query and run it. PostgreSQL will show you the execution plan that it used.

When using EXPLAIN for tuning, I recommend that you always use the ANALYZE option ( EXPLAIN ANALYZE ), as it gives you more accurate results. The ANALYZE option actually executes the statement (rather than just evaluating it), and then explains it.

Let's take a dip and begin to understand the output of EXPLAIN . Here is an example:

Knots

The first thing to understand is that each block indented with the previous “->” (along with the top line) is called a node. A node is a logical unit of work (a “step”, if you like) with associated cost and lead time. The cost and time presented on each node are cumulative and bring together all the child nodes. This means that the topmost line (node) shows the total cost and the actual time for the entire operator. This is important because you can easily drill down to determine which nodes are the bottleneck.

Cost

 cost=146.63..148.65

The first number is the initial cost (the cost of obtaining the first record), and the second number is the cost of processing the entire node (total cost from start to finish).

In fact, this is the cost that PostgreSQL estimates will have to be met in order to execute the statement. This number does not mean how long it will take to complete the request, although usually there is a direct relationship needed to complete. Cost is a combination of 5 working components used to evaluate the required work: sequential sampling, inconsistent (random) sampling, row processing, processing operator (function) and recording the processing index. Cost is the input / output and processor load, and it is important to know that the relatively high cost means that PostgresSQL believes that it will have to do more work. The optimizer decides which execution plan to use based on cost. The optimizer prefers lower costs.

Actual time

 actual time=55.009..55.012

In milliseconds, the first number is the start time (time to retrieve the first record), and the second number is the time required to process the entire node (total time from start to finish). Easy to understand, right?

In the above example, it took 55.009 ms to get the first record and 55.012 ms to complete the entire node.

Learn more about execution plans.

There are some really good articles for understanding EXPLAIN results. Instead of trying to retell them here, I recommend taking the time to really understand them by going to these 2 wonderful resources:

Request tuning

Now that you know which operators are working poorly and can see your execution plans, it's time to start tuning your query to improve performance. Here you make changes to your queries and / or add indexes to try to get a better execution plan. Start with bottlenecks and see if there are any changes you can make to reduce costs and / or lead time.

Data Cache and Cost Note

When making changes and evaluating implementation plans, in order to see if there will be improvements, it is important to know that future implementations may depend on caching data that gives an idea of the best results. If you run the request once, make a correction and run it a second time, most likely it will run much faster, even if the execution plan is not more favorable. This is because PostgreSQL could cache the data used on the first start, and can use it on the second start. Therefore, you must complete the queries at least 3 times and average the results to compare costs.

Things I learned can help improve execution plans:

Indices
- Exclude sequential scanning (Seq Scan) by adding indexes (if the table size is not small)
- When using a multi-column index, make sure that you pay attention to the order in which you define the included columns - More Information
- Try indexes that are highly selective for frequently used data. This will make their use more efficient.
Condition WHERE
- Avoid LIKE
- Avoid function calls in the WHERE clause
- Avoid big conditions in ()
JOINs
- When joining tables, try using a simple equality expression in the ON clause (i.e. a.id = b.person_id). This allows you to use more efficient join methods (i.e., Hash Join, not Nested Loop Join)
- Convert subqueries to JOIN statements whenever possible, as this usually allows the optimizer to understand the goal and possibly choose the best plan.
- Use COMPOUNDS correctly: do you use GROUP BY or DISTINCT just because you get duplicate results? This usually indicates improper use of JOINs and may result in higher costs.
- If the execution plan uses Hash Join, it can be very slow if the table size estimates are incorrect. Therefore, make sure your table statistics are accurate by reviewing the vacuuming strategy.
- Avoid correlated subqueries whenever possible; they can significantly increase the cost of a request
- Use EXISTS when checking the existence of strings based on a criterion, since it is like a short circuit (stops processing when it finds at least one match)
General recommendations
- Do more with less; Processor faster than input / output (I / O)
- Use Common Table Expressions and temporary tables when you need to perform chained queries.
- Avoid LOOP statements and prefer SET operations
- Avoid COUNT (*) as PostgresSQL scans tables for this ( only for versions <= 9.1 )
- Avoid ORDER BY, DISTINCT, GROUP BY, UNION whenever possible, as this leads to high initial costs.
- Look for the big difference between the estimated and actual lines in the EXPLAIN expression. If the counter is very different, the table statistics may be outdated, and PostgreSQL estimates the cost using inaccurate statistics. For example:
```
 Limit (cost=282.37..302.01 rows=93 width=22) (actual time=34.35..49.59 rows=2203 loops=1)
      
      

        
        
        
      

    
        
        
        
      
      

        
        
        
      

    
     
```
  The estimated number of lines was 93, and the actual - 2203. Therefore, most likely, this is a bad decision of the plan. You should review your vacuuming strategy and make sure that ANALYZE is executed often enough.

All Articles