Learn about antipatterns, execution plans, time complexity, query tuning, and SQL optimization
Structured Query Language (SQL) is an indispensable skill in the computer science industry, and generally speaking, learning this skill is relatively simple. However, most forget that SQL is not only about writing queries, it is just the first step further down the road. Ensuring query performance or matching the context in which you work is a completely different thing.
This is why this SQL guide will provide you with a small overview of some of the steps you can go through to evaluate your query:
- First, you will start with a brief overview of the importance of SQL learning for working in the field of data science;
- Next, you will first learn how to process and execute SQL queries in order to understand the importance of creating quality queries. More specifically, you will see that the request is analyzed, rewritten, optimized and finally evaluated.
- With this in mind, you will not only move on to some of the antipatterns of queries that beginners make when writing queries, but also learn more about alternatives and solutions to these possible errors; In addition, you will learn more about the set-based query approach.
- You will also see that these antipatterns stem from performance problems and that, in addition to the “manual” approach to improving SQL queries, you can analyze your queries in a more structured, in-depth way, using some other tools that help you see the query plan; AND,
- You will briefly learn about time complexity and big O notation, to get an idea of ​​the complexity of the execution plan in time before executing the request;
- You’ll briefly learn how to optimize your query.
Why should you learn SQL to work with data?
SQL is far from dead: this is one of the most sought-after skills that you find in job descriptions from the data processing and analysis industry, regardless of whether you apply for data analytics, data engineer, data specialist or any other roles. This is confirmed by 70% of respondents to the O 'Reilly Data Science Salary Survey for 2016, who indicate that they use SQL in their professional context. Moreover, in this survey, SQL stands out above the programming languages ​​R (57%) and Python (54%).
You get the picture: SQL is a necessary skill when you are working on getting a job in the IT industry.
Not bad for a language that was developed in the early 1970s, right?
But why is it so often used? And why didn’t he die, despite the fact that he exists for so long?
There are several reasons: one of the first reasons could be that companies mainly store data in relational database management systems (RDBMS) or in relational data flow management systems (RDSMS), and SQL is required to access this data. SQL is
lingua franca of data: it makes it possible to interact with almost any database or even build your own locally!
If this is still not enough, keep in mind that there are quite a few SQL implementations that are incompatible between vendors and do not necessarily conform to standards. Knowledge of standard SQL, therefore, is a requirement for you to find your way in the industry (computer science).
In addition, it is safe to say that newer technologies have also joined SQL, such as Hive, a SQL-like query language interface for querying and managing large data sets, or Spark SQL, which can be used to execute SQL queries. Again, the SQL you find there will be different from the standard that you could learn, but the learning curve will be much simpler.
If you want to make a comparison, consider it as learning linear algebra: having put all these efforts into this one subject, you know that you can use it to master machine learning as well!
In short, this is why you should learn this query language:
- It is quite easy to learn, even for beginners. The learning curve is quite simple and gradual, so you will write queries as soon as possible.
- It follows the principle of “learn once, use everywhere”, so this is a great investment of your time!
- This is a great addition to programming languages; In some cases, writing a query is even preferable to writing code, because it is more efficient!
- ...
What are you still waiting for? :)
SQL processing and query execution
To improve the performance of your SQL query, you first need to know what happens inside when you click a shortcut to execute the query.
First, the request is parsed into a parse tree; The request is analyzed for compliance with syntactic and semantic requirements. The parser creates an internal representation of the input request. This output is then transferred to the rewrite mechanism.
Then, the optimizer must find the optimal execution or query plan for the given query. The execution plan accurately determines which algorithm is used for each operation, and how operations are coordinated.
To find the most optimal execution plan, the optimizer lists all possible implementation plans, determines the quality or cost of each plan, receives information about the current state of the database, and then selects the best of them as the final implementation plan. Because query optimizers can be imperfect, users and database administrators sometimes have to manually examine and tune the plans created by the optimizer to improve performance.
Now you are probably wondering what is considered a “good query plan”.
As you already read, the quality of the cost of the plan plays an important role. More specifically, things such as the number of disk I / Os that are required to evaluate the plan, the cost of the CPU of the plan, and the total response time that the database client can observe, and the total execution time, are important. This is where the concept of time complexity arises. You will learn more about this later.
Then, the selected query plan is executed, evaluated by the system execution mechanism, and the query results are returned.
Writing SQL Queries
From the previous section, it may not have become clear that the principle of Garbage In, Garbage Out (GIGO) naturally manifests itself in the processing and execution of a query: the one who formulates the query also has keys to the performance of your SQL queries. If the optimizer receives a poorly formulated request, he can do just as much ...
This means that there are some things you can do when writing a request. As you have already seen in the introduction, the responsibility here is twofold: it is not only about writing queries that meet a specific standard, but also about collecting ideas about where performance problems might be hidden in your query.
An ideal starting point is to think of “places” in your queries where problems may arise. And, in general, there are four keywords in which beginners can expect performance problems to occur:
- Condition
WHERE
; - Any keywords
INNER JOIN
or LEFT JOIN
; As well as, -
HAVING
condition;
Of course, this approach is simple and naive, but, for a beginner, these points are excellent pointers, and it is safe to say that when you first start, errors occur in these places and, oddly enough, where it is also difficult to notice.
However, you should also understand that performance is something that should become meaningful. However, just saying that these sentences and keywords are bad is not what you need when you think about SQL performance. Having a
WHERE
or
HAVING
in a request does not necessarily mean that it is a bad request ...
Check out the next section to learn more about antipatterns and alternative approaches to building your query. These tips and tricks are intended as a guide. How and if you really need to rewrite your request depends, among other things, on the amount of data, the database, and the number of times you need to complete the request. It completely depends on the purpose of your request and to have some preliminary knowledge about the database with which you will work is crucial!
1. Retrieve only the necessary data
The conclusion “the more data, the better” - does not have to be followed when writing SQL: you risk not only getting confused by getting more data than you really need, but performance may suffer because your query receives too much data.
This is why, as a rule, you should pay attention to the
SELECT
, the
DISTINCT
SELECT
, and the
LIKE
statement.
SELECT
The first thing you can already verify when you write a query is whether the
SELECT
as compact as possible. The goal here should be to remove unnecessary columns from
SELECT
. This way you force yourself to only retrieve data that serves your query purpose.
If you have correlated subqueries with
EXISTS
, you should try to use a constant in the
SELECT
this subquery instead of choosing the value of the actual column. This is especially convenient when you only check for existence.
Remember that a correlated subquery is a subquery that uses values ​​from an external query. And note that even though
NULL
can work as a “constant” in this context, this is very confusing!
Consider the following example to understand what is meant by using a constant:
SELECT driverslicensenr, name FROM Drivers WHERE EXISTS (SELECT '1' FROM Fines WHERE fines.driverslicensenr = drivers.driverslicensenr);
Tip: It’s useful to know that having a correlated subquery is not always a good idea. You can always consider getting rid of them, for example, by rewriting them using
INNER JOIN
:
SELECT driverslicensenr, name FROM drivers INNER JOIN fines ON fines.driverslicensenr = drivers.driverslicensenr;
Operation DISTINCT
The
SELECT DISTINCT
used to return only different values.
DISTINCT
is a point that should definitely be avoided if possible. As in other examples, the execution time increases only when this sentence is added to the request. Therefore, it is always useful to consider whether you really need this
DISTINCT
operation to get the results you want to achieve.
LIKE
statement
When using the
LIKE
operator in a query, the index is not used if the pattern begins with
%
or
_
. This will prevent the database from using the index (if one exists). Of course, from another point of view, it can also be argued that this type of request potentially leaves it possible to get too many records that do not necessarily satisfy the purpose of the request.
Again, knowing the data stored in the database can help you formulate a template that will filter all the data correctly to find only the rows that are really important for your query.
2. Limit your results
If you cannot avoid filtering your
SELECT
, you can limit your results in other ways. This is where approaches such as the
LIMIT
and data type conversions come in.
TOP
, LIMIT
and ROWNUM
You can add
LIMIT
or
TOP
statements to queries to specify the maximum number of rows for the result set. Here are some examples:
SELECT TOP 3 * FROM Drivers;
Note that you can optionally specify
PERCENT
, for example, if you change the first query line with
SELECT TOP 50 PERCENT *
.
SELECT driverslicensenr, name FROM Drivers LIMIT 2;
Alternatively, you can add the
ROWNUM
equivalent to using
LIMIT
in the query:
SELECT * FROM Drivers WHERE driverslicensenr = 123456 AND ROWNUM <= 3;
Data type conversions
The most effective ones should always be used, i.e. smallest, data types. There is always a risk when you provide a huge data type, while a smaller one will be more sufficient.
However, when you add a data type conversion to the query, only the execution time increases.
An alternative is to avoid data type conversion as much as possible. Please also note that it is not always possible to remove or skip data type conversion from queries, but you should always try to include them, and you should check the effect of adding before executing the query.
3. Don't make queries more complicated than they should be
Data type conversions lead you to the following point: you should not overly design your queries. Try to make them simple and effective. This may seem too simple or stupid even to be a hint, mainly because the requests can be complex.
However, in the examples mentioned in the following sections, you will see that you can easily start making simple queries more complex than they should be.
OR
operator
When you use the
OR
operator in your query, most likely you are not using an index.
Remember that an index is a data structure that improves the speed of finding data in a database table, but it is expensive: additional records will be required and additional storage space will be required to maintain the index data structure. Indexes are used to quickly search or search for data without having to search every row in the database each time the database table is accessed. Indexes can be created using one or more columns in a database table.
If you do not use indexes included in the database, the execution of your query will inevitably take longer. This is why it is best to look for alternatives to using the
OR
operator in your query;
Consider the following query:
SELECT driverslicensenr, name FROM Drivers WHERE driverslicensenr = 123456 OR driverslicensenr = 678910 OR driverslicensenr = 345678;
The operator can be replaced by:
Condition with
IN
; or
SELECT driverslicensenr, name FROM Drivers WHERE driverslicensenr IN (123456, 678910, 345678);
Two
SELECT
with
UNION
.
Tip: here you must be careful not to use the unnecessary
UNION
operation because you are viewing the same table multiple times. At the same time, you should understand that when you use
UNION
in your query, the execution time increases. Alternatives to the
UNION
operation: reformulate the query so that all conditions are placed in a single
SELECT
, or use
OUTER JOIN
instead of
UNION
.
Tip: keep in mind that while
OR
- and the other operators that will be mentioned in the following sections - most likely do not use an index, index search is not always preferable!
NOT
operator
When your query contains a
NOT
operator, it is likely that the index is not used, as with the
OR
operator. This will inevitably slow down your request. If you do not know what is meant here, consider the following query:
SELECT driverslicensenr, name FROM Drivers WHERE NOT (year > 1980);
This request will certainly run slower than you might expect, mainly because it is formulated much more complex than it can be: in cases like this, it is best to look for an alternative. Consider replacing
NOT
comparison operators such as
>
,
<>
or
!>
; The above example can actually be rewritten and look something like this:
SELECT driverslicensenr, name FROM Drivers WHERE year <= 1980;
It already looks better, right?
AND
operator
The
AND
operator is another operator that does not use an index and that can slow down a query if it is used in an overly complex and inefficient way, as in the following example:
SELECT driverslicensenr, name FROM Drivers WHERE year >= 1960 AND year <= 1980;
It is better to rewrite this query using the
BETWEEN
statement:
SELECT driverslicensenr, name FROM Drivers WHERE year BETWEEN 1960 AND 1980;
ANY
and ALL
operators
In addition, the
ANY
and
ALL
operators are the ones you should be careful with, because if you include them in your queries, the index will not be used. Alternative aggregation functions such as
MIN
or
MAX
are useful here.
Tip: when you use the proposed alternatives, you should be aware that all aggregation functions, such as
SUM
,
AVG
,
MIN
,
MAX
over many lines, can lead to a long query. In such cases, you can try to minimize the number of rows to process or pre-calculate these values. Once again you see that it is important to know about your environment, your purpose of the request, ... When you decide which request to use!
Isolate columns in conditions
Also, in cases where a column is used in a calculation or in a scalar function, the index is not used. A possible solution would be to simply select a specific column so that it is no longer part of the calculation or function. Consider the following example:
SELECT driverslicensenr, name FROM Drivers WHERE year + 10 = 1980;
It looks funny, huh? Instead, try revising the calculation and rewrite the query like this:
SELECT driverslicensenr, name FROM Drivers WHERE year = 1970;
4. Lack of brute force
This last tip means that you should not try to limit the request too much, as this may affect its performance. This is especially true for joins and for the HAVING clause.
Table order in joins
When joining two tables, it may be important to consider the order of the tables in the join. If you see that one table is significantly larger than the other, you might need to rewrite the query so that the largest table is placed last in the join.
Excessive connection conditions
If you add too many conditions to SQL connections, you must choose a specific path. However, it may be that this path is not always more efficient.
HAVING
Condition
The
HAVING
was originally added to SQL because the
WHERE
keyword could not be used with aggregate functions.
HAVING
typically used with the
GROUP BY
to restrict groups of returned rows to only those that satisfy certain conditions. However, if this condition is used in the query, the index is not used, which, as you already know, can lead to the fact that the query actually does not work so well.
If you are looking for an alternative, try using the
WHERE
.
Consider the following queries:
SELECT state, COUNT(*) FROM Drivers WHERE state IN ('GA', 'TX') GROUP BY state ORDER BY state
SELECT state, COUNT(*) FROM Drivers GROUP BY state HAVING state IN ('GA', 'TX') ORDER BY state
The first query uses the
WHERE
to limit the number of rows that need to be summarized, while the second query sums all the rows in the table and then uses
HAVING
to discard the calculated amounts. In such cases, the
WHERE
option is clearly better since you are not wasting resources.
It can be seen that this is not about limiting the result set, but about limiting the intermediate number of records in the query.
It should be noted that the difference between the two conditions is that the
WHERE
introduces a condition for individual rows, while the
HAVING
introduces a condition for aggregations or selection results, where one result, such as
MIN
,
MAX
,
SUM
, ... was created from multiple lines.
You see, quality assessment, writing and rewriting requests is not an easy task, given that they should be as productive as possible; Preventing antipatterns and considering alternatives will also be part of the responsibility when writing queries that need to be performed on databases in a professional environment.
This list was just a small overview of some antipatterns and tips that I hope will help beginners; If you want to get an idea of ​​what older developers consider the most common anti-patterns, check
out this discussion .
Set-based versus procedural approaches to writing queries
The aforementioned antipatterns implied that they actually come down to the difference in sets-based and procedural approaches to constructing your queries.
A procedural approach to queries is an approach very similar to programming: you tell the system what to do and how to do it.
An example of this is excessive conditions in connections or cases when you abuse the
HAVING
conditions, as in the examples above, in which you query the database by executing a function and then calling another function, or you use logic that contains conditions, loops, user-defined functions ( UDF), cursors, ... to get the end result. With this approach, you will often request a subset of the data, then request another subset of the data, and so on.
Unsurprisingly, this approach is often called a “step-by-step” or “line-by-line” query.
Another approach is a set-based approach, where you simply indicate what to do. Your role is to specify the conditions or requirements for the result set that you want to receive from the query. You leave the way your data is retrieved to the internal mechanisms that determine the implementation of the query: you let the database engine determine the best algorithms or processing logic to execute your query.
Since SQL is set-based, it is not surprising that this approach will be more efficient than procedural, and it also explains why, in some cases, SQL can run faster than code.
Advice is a set- based approach to querying is also one that most leading employers in the information technology industry will ask you to master! It is often necessary to switch between these two types of approaches.
Please note that if you ever need a procedural request, you should consider rewriting or refactoring it.
The next part will cover the plan and query optimization.