SQL Guide: How to Write Queries Better (Part 1)

Learn about antipatterns, execution plans, time complexity, query tuning, and SQL optimization

Structured Query Language (SQL) is an indispensable skill in the computer science industry, and generally speaking, learning this skill is relatively simple. However, most forget that SQL is not only about writing queries, it is just the first step further down the road. Ensuring query performance or matching the context in which you work is a completely different thing.

This is why this SQL guide will provide you with a small overview of some of the steps you can go through to evaluate your query:

First, you will start with a brief overview of the importance of SQL learning for working in the field of data science;
Next, you will first learn how to process and execute SQL queries in order to understand the importance of creating quality queries. More specifically, you will see that the request is analyzed, rewritten, optimized and finally evaluated.
With this in mind, you will not only move on to some of the antipatterns of queries that beginners make when writing queries, but also learn more about alternatives and solutions to these possible errors; In addition, you will learn more about the set-based query approach.
You will also see that these antipatterns stem from performance problems and that, in addition to the “manual” approach to improving SQL queries, you can analyze your queries in a more structured, in-depth way, using some other tools that help you see the query plan; AND,
You will briefly learn about time complexity and big O notation, to get an idea of the complexity of the execution plan in time before executing the request;
You’ll briefly learn how to optimize your query.

Why should you learn SQL to work with data?

SQL is far from dead: this is one of the most sought-after skills that you find in job descriptions from the data processing and analysis industry, regardless of whether you apply for data analytics, data engineer, data specialist or any other roles. This is confirmed by 70% of respondents to the O 'Reilly Data Science Salary Survey for 2016, who indicate that they use SQL in their professional context. Moreover, in this survey, SQL stands out above the programming languages R (57%) and Python (54%).

You get the picture: SQL is a necessary skill when you are working on getting a job in the IT industry.

Not bad for a language that was developed in the early 1970s, right?

But why is it so often used? And why didn’t he die, despite the fact that he exists for so long?

There are several reasons: one of the first reasons could be that companies mainly store data in relational database management systems (RDBMS) or in relational data flow management systems (RDSMS), and SQL is required to access this data. SQL is lingua franca of data: it makes it possible to interact with almost any database or even build your own locally!

If this is still not enough, keep in mind that there are quite a few SQL implementations that are incompatible between vendors and do not necessarily conform to standards. Knowledge of standard SQL, therefore, is a requirement for you to find your way in the industry (computer science).

In addition, it is safe to say that newer technologies have also joined SQL, such as Hive, a SQL-like query language interface for querying and managing large data sets, or Spark SQL, which can be used to execute SQL queries. Again, the SQL you find there will be different from the standard that you could learn, but the learning curve will be much simpler.

If you want to make a comparison, consider it as learning linear algebra: having put all these efforts into this one subject, you know that you can use it to master machine learning as well!

In short, this is why you should learn this query language:

It is quite easy to learn, even for beginners. The learning curve is quite simple and gradual, so you will write queries as soon as possible.
It follows the principle of “learn once, use everywhere”, so this is a great investment of your time!
This is a great addition to programming languages; In some cases, writing a query is even preferable to writing code, because it is more efficient!
...

What are you still waiting for? :)

SQL processing and query execution

To improve the performance of your SQL query, you first need to know what happens inside when you click a shortcut to execute the query.

First, the request is parsed into a parse tree; The request is analyzed for compliance with syntactic and semantic requirements. The parser creates an internal representation of the input request. This output is then transferred to the rewrite mechanism.

Then, the optimizer must find the optimal execution or query plan for the given query. The execution plan accurately determines which algorithm is used for each operation, and how operations are coordinated.

To find the most optimal execution plan, the optimizer lists all possible implementation plans, determines the quality or cost of each plan, receives information about the current state of the database, and then selects the best of them as the final implementation plan. Because query optimizers can be imperfect, users and database administrators sometimes have to manually examine and tune the plans created by the optimizer to improve performance.

Now you are probably wondering what is considered a “good query plan”.

As you already read, the quality of the cost of the plan plays an important role. More specifically, things such as the number of disk I / Os that are required to evaluate the plan, the cost of the CPU of the plan, and the total response time that the database client can observe, and the total execution time, are important. This is where the concept of time complexity arises. You will learn more about this later.

Then, the selected query plan is executed, evaluated by the system execution mechanism, and the query results are returned.

Writing SQL Queries

From the previous section, it may not have become clear that the principle of Garbage In, Garbage Out (GIGO) naturally manifests itself in the processing and execution of a query: the one who formulates the query also has keys to the performance of your SQL queries. If the optimizer receives a poorly formulated request, he can do just as much ...

This means that there are some things you can do when writing a request. As you have already seen in the introduction, the responsibility here is twofold: it is not only about writing queries that meet a specific standard, but also about collecting ideas about where performance problems might be hidden in your query.

An ideal starting point is to think of “places” in your queries where problems may arise. And, in general, there are four keywords in which beginners can expect performance problems to occur:

Condition WHERE

;
Any keywords INNER JOIN

or LEFT JOIN

; As well as,
HAVING

condition;

Of course, this approach is simple and naive, but, for a beginner, these points are excellent pointers, and it is safe to say that when you first start, errors occur in these places and, oddly enough, where it is also difficult to notice.

However, you should also understand that performance is something that should become meaningful. However, just saying that these sentences and keywords are bad is not what you need when you think about SQL performance. Having a WHERE

or HAVING

in a request does not necessarily mean that it is a bad request ...

Check out the next section to learn more about antipatterns and alternative approaches to building your query. These tips and tricks are intended as a guide. How and if you really need to rewrite your request depends, among other things, on the amount of data, the database, and the number of times you need to complete the request. It completely depends on the purpose of your request and to have some preliminary knowledge about the database with which you will work is crucial!

1. Retrieve only the necessary data

The conclusion “the more data, the better” - does not have to be followed when writing SQL: you risk not only getting confused by getting more data than you really need, but performance may suffer because your query receives too much data.

This is why, as a rule, you should pay attention to the SELECT

, the DISTINCT

SELECT

, and the LIKE

statement.

`SELECT`

The first thing you can already verify when you write a query is whether the SELECT

as compact as possible. The goal here should be to remove unnecessary columns from SELECT

. This way you force yourself to only retrieve data that serves your query purpose.

If you have correlated subqueries with EXISTS

, you should try to use a constant in the SELECT

this subquery instead of choosing the value of the actual column. This is especially convenient when you only check for existence.

Remember that a correlated subquery is a subquery that uses values from an external query. And note that even though NULL

can work as a “constant” in this context, this is very confusing!

Consider the following example to understand what is meant by using a constant:

 SELECT driverslicensenr, name FROM Drivers WHERE EXISTS (SELECT '1' FROM Fines WHERE fines.driverslicensenr = drivers.driverslicensenr);

Tip: It’s useful to know that having a correlated subquery is not always a good idea. You can always consider getting rid of them, for example, by rewriting them using INNER JOIN

:

 SELECT driverslicensenr, name FROM drivers INNER JOIN fines ON fines.driverslicensenr = drivers.driverslicensenr;

Operation `DISTINCT`

The SELECT DISTINCT

used to return only different values. DISTINCT

is a point that should definitely be avoided if possible. As in other examples, the execution time increases only when this sentence is added to the request. Therefore, it is always useful to consider whether you really need this DISTINCT

operation to get the results you want to achieve.

`LIKE`

statement

When using the LIKE

operator in a query, the index is not used if the pattern begins with %

or _

. This will prevent the database from using the index (if one exists). Of course, from another point of view, it can also be argued that this type of request potentially leaves it possible to get too many records that do not necessarily satisfy the purpose of the request.

Again, knowing the data stored in the database can help you formulate a template that will filter all the data correctly to find only the rows that are really important for your query.

2. Limit your results

If you cannot avoid filtering your SELECT

, you can limit your results in other ways. This is where approaches such as the LIMIT

and data type conversions come in.

`TOP`

, `LIMIT`

and `ROWNUM`

You can add LIMIT

or TOP

statements to queries to specify the maximum number of rows for the result set. Here are some examples:

  SELECT TOP 3 * FROM Drivers;

Note that you can optionally specify PERCENT

, for example, if you change the first query line with SELECT TOP 50 PERCENT *

.

 SELECT driverslicensenr, name FROM Drivers LIMIT 2;

Alternatively, you can add the ROWNUM

equivalent to using LIMIT

in the query:

 SELECT * FROM Drivers WHERE driverslicensenr = 123456 AND ROWNUM <= 3;

Data type conversions

The most effective ones should always be used, i.e. smallest, data types. There is always a risk when you provide a huge data type, while a smaller one will be more sufficient.

However, when you add a data type conversion to the query, only the execution time increases.

An alternative is to avoid data type conversion as much as possible. Please also note that it is not always possible to remove or skip data type conversion from queries, but you should always try to include them, and you should check the effect of adding before executing the query.

3. Don't make queries more complicated than they should be

Data type conversions lead you to the following point: you should not overly design your queries. Try to make them simple and effective. This may seem too simple or stupid even to be a hint, mainly because the requests can be complex.

However, in the examples mentioned in the following sections, you will see that you can easily start making simple queries more complex than they should be.

`OR`

operator

When you use the OR

operator in your query, most likely you are not using an index.

Remember that an index is a data structure that improves the speed of finding data in a database table, but it is expensive: additional records will be required and additional storage space will be required to maintain the index data structure. Indexes are used to quickly search or search for data without having to search every row in the database each time the database table is accessed. Indexes can be created using one or more columns in a database table.

If you do not use indexes included in the database, the execution of your query will inevitably take longer. This is why it is best to look for alternatives to using the OR

operator in your query;

Consider the following query:

 SELECT driverslicensenr, name FROM Drivers WHERE driverslicensenr = 123456 OR driverslicensenr = 678910 OR driverslicensenr = 345678;

The operator can be replaced by:

Condition with IN

; or

 SELECT driverslicensenr, name FROM Drivers WHERE driverslicensenr IN (123456, 678910, 345678);

Two SELECT

with UNION

.

Tip: here you must be careful not to use the unnecessary UNION

operation because you are viewing the same table multiple times. At the same time, you should understand that when you use UNION

in your query, the execution time increases. Alternatives to the UNION

operation: reformulate the query so that all conditions are placed in a single SELECT

, or use OUTER JOIN

instead of UNION

.

Tip: keep in mind that while OR

- and the other operators that will be mentioned in the following sections - most likely do not use an index, index search is not always preferable!

`NOT`

operator

When your query contains a NOT

operator, it is likely that the index is not used, as with the OR

operator. This will inevitably slow down your request. If you do not know what is meant here, consider the following query:

 SELECT driverslicensenr, name FROM Drivers WHERE NOT (year > 1980);

This request will certainly run slower than you might expect, mainly because it is formulated much more complex than it can be: in cases like this, it is best to look for an alternative. Consider replacing NOT

comparison operators such as >

, <>

or !>

; The above example can actually be rewritten and look something like this:

 SELECT driverslicensenr, name FROM Drivers WHERE year <= 1980;

It already looks better, right?

`AND`

operator

The AND

operator is another operator that does not use an index and that can slow down a query if it is used in an overly complex and inefficient way, as in the following example:

 SELECT driverslicensenr, name FROM Drivers WHERE year >= 1960 AND year <= 1980;

It is better to rewrite this query using the BETWEEN

statement:

 SELECT driverslicensenr, name FROM Drivers WHERE year BETWEEN 1960 AND 1980;

`ANY`

and `ALL`

operators

In addition, the ANY

and ALL

operators are the ones you should be careful with, because if you include them in your queries, the index will not be used. Alternative aggregation functions such as MIN

or MAX

are useful here.

Tip: when you use the proposed alternatives, you should be aware that all aggregation functions, such as SUM

, AVG

, MIN

, MAX

over many lines, can lead to a long query. In such cases, you can try to minimize the number of rows to process or pre-calculate these values. Once again you see that it is important to know about your environment, your purpose of the request, ... When you decide which request to use!

Isolate columns in conditions

Also, in cases where a column is used in a calculation or in a scalar function, the index is not used. A possible solution would be to simply select a specific column so that it is no longer part of the calculation or function. Consider the following example:

 SELECT driverslicensenr, name FROM Drivers WHERE year + 10 = 1980;

It looks funny, huh? Instead, try revising the calculation and rewrite the query like this:

 SELECT driverslicensenr, name FROM Drivers WHERE year = 1970;

4. Lack of brute force

This last tip means that you should not try to limit the request too much, as this may affect its performance. This is especially true for joins and for the HAVING clause.

Table order in joins

When joining two tables, it may be important to consider the order of the tables in the join. If you see that one table is significantly larger than the other, you might need to rewrite the query so that the largest table is placed last in the join.

Excessive connection conditions

If you add too many conditions to SQL connections, you must choose a specific path. However, it may be that this path is not always more efficient.

`HAVING`

Condition

The HAVING

was originally added to SQL because the WHERE

keyword could not be used with aggregate functions. HAVING

typically used with the GROUP BY

to restrict groups of returned rows to only those that satisfy certain conditions. However, if this condition is used in the query, the index is not used, which, as you already know, can lead to the fact that the query actually does not work so well.

If you are looking for an alternative, try using the WHERE

.

Consider the following queries:

 SELECT state, COUNT(*) FROM Drivers WHERE state IN ('GA', 'TX') GROUP BY state ORDER BY state

 SELECT state, COUNT(*) FROM Drivers GROUP BY state HAVING state IN ('GA', 'TX') ORDER BY state

The first query uses the WHERE

to limit the number of rows that need to be summarized, while the second query sums all the rows in the table and then uses HAVING

to discard the calculated amounts. In such cases, the WHERE

option is clearly better since you are not wasting resources.

It can be seen that this is not about limiting the result set, but about limiting the intermediate number of records in the query.

It should be noted that the difference between the two conditions is that the WHERE

introduces a condition for individual rows, while the HAVING

introduces a condition for aggregations or selection results, where one result, such as MIN

, MAX

, SUM

, ... was created from multiple lines.

You see, quality assessment, writing and rewriting requests is not an easy task, given that they should be as productive as possible; Preventing antipatterns and considering alternatives will also be part of the responsibility when writing queries that need to be performed on databases in a professional environment.

This list was just a small overview of some antipatterns and tips that I hope will help beginners; If you want to get an idea of what older developers consider the most common anti-patterns, check out this discussion .

Set-based versus procedural approaches to writing queries

The aforementioned antipatterns implied that they actually come down to the difference in sets-based and procedural approaches to constructing your queries.

A procedural approach to queries is an approach very similar to programming: you tell the system what to do and how to do it.

An example of this is excessive conditions in connections or cases when you abuse the HAVING

conditions, as in the examples above, in which you query the database by executing a function and then calling another function, or you use logic that contains conditions, loops, user-defined functions ( UDF), cursors, ... to get the end result. With this approach, you will often request a subset of the data, then request another subset of the data, and so on.

Unsurprisingly, this approach is often called a “step-by-step” or “line-by-line” query.

Another approach is a set-based approach, where you simply indicate what to do. Your role is to specify the conditions or requirements for the result set that you want to receive from the query. You leave the way your data is retrieved to the internal mechanisms that determine the implementation of the query: you let the database engine determine the best algorithms or processing logic to execute your query.

Since SQL is set-based, it is not surprising that this approach will be more efficient than procedural, and it also explains why, in some cases, SQL can run faster than code.

Advice is a set- based approach to querying is also one that most leading employers in the information technology industry will ask you to master! It is often necessary to switch between these two types of approaches.

Please note that if you ever need a procedural request, you should consider rewriting or refactoring it.

The next part will cover the plan and query optimization.

All Articles