SQL Guide: How to Write Queries Better (Part 2)

Continued article SQL Guide: How to Write Queries Better (Part 1)



From request to execution plans



Knowing that antipatterns are not static and evolve as you grow as a SQL developer, and the fact that there are many things to keep in mind when thinking about alternatives also means that avoiding antipatterns and rewriting queries can be quite difficult task. Any help can come in handy, which is why a more structured approach to query optimization using some tools may be most effective.



It should also be noted that some of the antipatterns mentioned in the last section are rooted in performance issues, such as the AND



, OR



and NOT



operators and their absence when using indices. Thinking about performance requires not only a more structured, but also a deeper approach.



However, this structured and in-depth approach will mainly be based on the query plan, which, as you recall, is the result of a query first parsed into a “parse tree” or “parse tree” and determines exactly which algorithm used for each operation and how their execution is coordinated.



Query optimization



As you read in the introduction, you may need to check and set up plans that are manually compiled by the optimizer. In such cases, you will need to analyze your request again by looking at the request plan.



To access this plan, you must use the tools provided by the database management system. The following tools may be at your disposal:





Please note that when working with PostgreSQL, you can make a distinction between EXPLAIN



, where you simply get a description that tells how the planner intends to execute the query without executing it, while EXPLAIN ANALYZE



actually executes the query and returns you the analysis expected and actual request plans. Generally speaking, a real execution plan is a plan in which the request is actually executed, while an evaluation execution plan determines what it will do without fulfilling the request. Although this is logically equivalent, the actual execution plan is much more useful as it contains additional information and statistics about what really happened when the request was executed.



In the rest of this section, you will learn more about EXPLAIN



and ANALYZE



, and how to use them to get more information about the query plan and its possible performance. To do this, start with a few examples in which you will work with two tables: one_million



and half_million



.



You can get the current information from the one_million



table using EXPLAIN



; Be sure to place it directly above the request, and after executing it, it will return the query plan to you:



 EXPLAIN SELECT * FROM one_million; QUERY PLAN ____________________________________________________ Seq Scan on one_million (cost=0.00..18584.82 rows=1025082 width=36) (1 row)
      
      





In this case, you see that the cost of the request is 0.00..18584.82



, and the number of rows is 1025082



. The width of the number of columns is 36



.



In addition, you can update statistics using ANALYZE



.



 ANALYZE one_million; EXPLAIN SELECT * FROM one_million; QUERY PLAN ____________________________________________________ Seq Scan on one_million (cost=0.00..18334.00 rows=1000000 width=37) (1 row)
      
      





Besides EXPLAIN



and ANALYZE



, you can also get the actual runtime with EXPLAIN ANALYZE



:



 EXPLAIN ANALYZE SELECT * FROM one_million; QUERY PLAN ___________________________________________________________ Seq Scan on one_million (cost=0.00..18334.00 rows=1000000 width=37) (actual time=0.015..1207.019 rows=1000000 loops=1) Total runtime: 2320.146 ms (2 rows)
      
      





The disadvantage of using EXPLAIN ANALYZE



is that the query is actually executed, so be careful with this!



So far, all the algorithms that you have seen are Seq Scan



(Sequential Scan) or Full Table Scan: this is a scan performed in a database where each row of the scanned table is read in serial order and the columns found are checked for compliance with the condition or not. In terms of performance, sequential scans are certainly not the best execution plan because you are still doing a full table scan. However, this is not so bad when the table does not fit in memory: sequential reads are pretty fast even on slow disks.



You will learn more about this later when we talk about index scan.



However, there are other algorithms. Take, for example, this query plan for a connection:



 EXPLAIN ANALYZE SELECT * FROM one_million JOIN half_million ON (one_million.counter=half_million.counter); QUERY PLAN _________________________________________________________________ Hash Join (cost=15417.00..68831.00 rows=500000 width=42) (actual time=1241.471..5912.553 rows=500000 loops=1) Hash Cond: (one_million.counter = half_million.counter) -> Seq Scan on one_million (cost=0.00..18334.00 rows=1000000 width=37) (actual time=0.007..1254.027 rows=1000000 loops=1) -> Hash (cost=7213.00..7213.00 rows=500000 width=5) (actual time=1241.251..1241.251 rows=500000 loops=1) Buckets: 4096 Batches: 16 Memory Usage: 770kB -> Seq Scan on half_million (cost=0.00..7213.00 rows=500000 width=5) (actual time=0.008..601.128 rows=500000 loops=1) Total runtime: 6468.337 ms
      
      





You see that the query optimizer chose Hash Join



here! Remember this operation, as you will need it to assess the time complexity of your request. For now, note that there is no index in half_million.counter



, which we add in the following example:



 CREATE INDEX ON half_million(counter); EXPLAIN ANALYZE SELECT * FROM one_million JOIN half_million ON (one_million.counter=half_million.counter); QUERY PLAN ________________________________________________________________ Merge Join (cost=4.12..37650.65 rows=500000 width=42) (actual time=0.033..3272.940 rows=500000 loops=1) Merge Cond: (one_million.counter = half_million.counter) -> Index Scan using one_million_counter_idx on one_million (cost=0.00..32129.34 rows=1000000 width=37) (actual time=0.011..694.466 rows=500001 loops=1) -> Index Scan using half_million_counter_idx on half_million (cost=0.00..14120.29 rows=500000 width=5) (actual time=0.010..683.674 rows=500000 loops=1) Total runtime: 3833.310 ms (5 rows)
      
      





You see that by creating the index, the query optimizer has now decided to use Merge join



when scanning the Index Scan



index.



Note the difference between index scanning and full table scanning or sequential scanning: the first, also called “table scanning”, scans the data or index pages to find the corresponding records, while the second scans each row of the table.



You see that the overall runtime has decreased and performance should be better, but there are two index scans, which makes memory more important here, especially if the table does not fit into it. In such cases, you must first perform a full index scan, which is performed using fast sequential reads and is not a problem, but then you have many random read operations to select rows by index value. These are random read operations that are usually several orders of magnitude slower than sequential ones. In these cases, a full table scan does indeed happen faster than a full index scan.



Tip: If you want to learn more about EXPLAIN or consider examples in more detail, consider reading Guillaume Lelarge ’s Understanding Explain .



Time Complexity and Big O



Now that you’ve briefly reviewed the query plan, you can start digging deeper and think about performance in more formal terms using the theory of computational complexity. This is a field of theoretical computer science, which, among other things, focuses on the classification of computational problems depending on their complexity; These computational problems may be algorithms, but also queries.



However, for queries, they are not necessarily classified according to their complexity, but rather, depending on the time required to complete them and obtain results. This is called time complexity, and you can use the big O notation to formulate or measure this type of complexity.



With the designation big O, you express the runtime in terms of how fast it grows relative to the input, as the input becomes arbitrarily large. The large O notation excludes coefficients and members of a lower order, so you can focus on the important part of the execution time of your query: its growth rate. When expressed in this way, discarding the coefficients and terms of a lower order, they say that the time complexity is described asymptotically. This means that the input size goes to infinity.



In a database language, complexity determines how long it takes to complete a query as the size of the data tables and therefore the database grows.



Please note that the size of your database increases not only from the increase in the amount of data in the tables, but the fact that there are indexes also plays a role in size.



Estimating the time complexity of your query plan



As you saw earlier, the execution plan, among other things, determines which algorithm is used for each operation, which allows you to logically express each query execution time as a function of the size of the table included in the query plan, which is called the complexity function. In other words, you can use the big O notation and the execution plan to evaluate query complexity and performance.



In the following sections, you will get an overview of the four types of time complexity, and you will see some examples of how the time complexity of requests can vary depending on the context in which it is executed.



Hint: indexes are part of this story!



However, it should be noted that there are different types of indexes, different execution plans, and different implementations for different databases, so the temporary difficulties listed below are very general and may vary depending on specific settings.



O (1): Constant Time



They say that an algorithm works in constant time if it needs the same amount of time regardless of the size of the input data. When it comes to a query, it will be executed in constant time if the same amount of time is required regardless of the size of the table.



This type of query is not really common, but here is one such example:



 SELECT TOP 1 t.* FROM t
      
      





The time complexity is constant, since one arbitrary row is selected from the table. Therefore, the length of time should not depend on the size of the table.



Linear Time: O (n)



They say that the algorithm works in linear time, if its execution time is directly proportional to the size of the input data, that is, the time increases linearly with the size of the input data. For databases, this means that the execution time will be directly proportional to the size of the table: as the number of rows in the table grows, the query execution time increases.



An example is a query with a WHERE



for an unindexed column: a full table scan or Seq Scan



will be required, which will lead to O (n) time complexity. This means that each line must be read in order to find the line with the desired identifier (ID). You do not have any restrictions at all, so you need to count every line, even if the first line matches the condition.



Consider also the following example query, which will have O (n) complexity if there is no index on the i_id



field:



 SELECT i_id FROM item;
      
      







The linear runtime is closely related to the runtime of plans that have table joins. Here are some examples:





Remember: a nested join is a join that compares each record in one table with each record in another.



Logarithmic Time: O (log (n))



It is said that an algorithm works in logarithmic time if its execution time is proportional to the logarithm of the input size; For queries, this means that they will be executed if the execution time is proportional to the logarithm of the database size.



This logarithmic time complexity is valid for query plans where Index Scan



or clustered index scans. A clustered index is an index where the final level of the index contains the actual rows of the table. A clustered index is similar to any other index: it is defined in one or more columns. They form the index key. The clustering key is the key columns of a clustered index. Scanning a clustered index is basically the operation of reading your DBMS for a row or rows from top to bottom in a clustered index.



Consider the following query example, where there is an index for i_id



and which usually results in O (log (n)) complexity:



 SELECT i_stock FROM item WHERE i_id = N;
      
      





Note that without an index, the time complexity would be O (n).



Quadratic Time: O (n ^ 2)



It is believed that the algorithm is executed in quadratic time, if its execution time is proportional to the square of the input size. Again, for databases, this means that the query execution time is proportional to the square of the database size.



A possible example of a quadratic time complexity query is the following:



 SELECT * FROM item, author WHERE item.i_a_id=author.a_id
      
      





The minimum complexity may be O (n log (n)), but the maximum complexity may be O (n ^ 2) based on the connection attribute index information.



To summarize, you can also take a look at the following cheat sheet to evaluate query performance based on their time complexity and their effectiveness:









SQL tuning



Given the query execution plan and time complexity, you can further customize your SQL query. You can start by focusing on the following points:





Further use of SQL



Congratulations! You have come to the end of this article, which just gave you a small look at the performance of SQL queries. I hope you have more information about antipatterns, the query optimizer, and the tools you can use to analyze, evaluate, and interpret the complexity of your query plan. However, you still have so much to discover! If you want to know more, read the book “Database Management Systems” by R. Ramakrishnan and J. Gehrke.



Finally, I do not want to deny StackOverflow from you in this quote:

My favorite antipattern does not check your requests.



However, it is applicable when:



  • Your query provides more than one table.
  • You think that you have the optimal design for the request, but do not try to test your assumptions.
  • You accept the first working request, not knowing how close it is to the optimum.



All Articles