⚠️ 🤴 🥄 Around the World in 4 Seconds at Columnstore (Part 1) 👾 👨🏼‍⚕️ 🏛️

In this article, I am going to consider increasing the speed of reports. By a report, I mean any query to a database that uses aggregate functions. Also, I am going to touch upon issues related to the resources spent on the production and support of reports, both human and machine.

In the examples, I will use a data set containing 52,608,000 records.

Using the example of not difficult analytical reserves, I will demonstrate that even a weak computer can be turned into a good tool for analyzing a “decent” amount of data without much effort.

Having set up not complicated experiments, we will see that a regular table is not a suitable source for analytical queries.

If the reader can easily decrypt the abbreviations OLTP and OLAP, it may make sense to go directly to the Columnstore section

Two approaches to working with data

Here I will be brief, because There is more than enough information on this topic on the Internet.

So, at the highest level, there are only two approaches to working with data: OLTP and OLAP.

OLTP - can be translated as instant transaction processing. In fact, we are talking about online processing of short transactions that work with a small amount of data. For example, recording, updating or deleting an order. In the vast majority of cases, an order is an extremely small amount of data, during the processing of which you can not be afraid of the long locks imposed by modern RDBMSs.

OLAP - can be translated as analytical processing of a large number of transactions at a time. Any report uses this particular approach, because in the vast majority of cases the report produces consolidated, aggregated figures for certain sections.

Each approach has its own technology. For example, for OLTP it is PostgreSQL, and for OLAP it is Microsoft SQL Server Analysis Services. While PostgresSQL uses a well-known format for storing data in tables, several different formats were invented for OLAP. These are multidimensional tables, bucket filled with key-value pairs and my favorite columnstore. About the latter in more detail below.

Why are two approaches needed?

It was noted that any data warehouse sooner or later faces two types of load: frequent reading (writing and updating, of course, too) of extremely small amounts of data and rare reading, but very large amounts of data. In fact, this is activity, for example, of the cash register and the head. The cash desk, working all day, fills the storage with small chunks of data, while at the end of the day the volume of accumulated, if the business is going well, reaches an impressive size. In turn, the manager at the end of the day wants to know how much money the box office earned per day.

So, in OLTP we have tables and indexes. These two tools are great for recording box office activity with all the details. Indexes provide a quick search for a previously recorded order, so changing an order is easy. But in order to satisfy the needs of the leader, we need to consider the entire amount of data accumulated per day. In addition, as a rule, the manager does not need all the details of all orders. What he really needs to know is how much money the box office made in general. It doesn’t matter where the ticket office was, when there was a lunch break, who worked for it, etc. OLAP exists then, so that in a short time period the system can answer the question - how much the company has earned as a whole without sequentially reading each order and all its details. Can OLAP use the same tables and indexes as OLTP? The answer is no, at least it shouldn't. Firstly, because OLAP just doesn’t need all the details recorded in the tables. This problem is solved by storing data in other formats other than two-dimensional tables. Secondly, the analyzed information is often scattered across different tables, which entails their multiple associations, including self-join associations. To solve this problem, as a rule, they develop a special database schema. This scheme is optimized for OLAP load, as well as the normal normalized scheme for OLTP load.

What happens when OLAP uses an OLTP scheme

In fact, I introduced this section so that this article clearly meets my own requirements for the format of such material, i.e. problem, solution, conclusion.

We list a number of disadvantages of using OLTP schemes for data analysis.

Too many indexes.

Often, you have to create special indexes to support reports. These indexes implement an OLAP data storage scheme. They are not used by the OLTP part of the application, while exerting a load on it, requiring constant support and taking up disk space.
The amount of data read exceeds the required.
Lack of a clear data scheme.

The fact is that often the information submitted by reports in a single form is spread out in different tables. Such information requires constant transformation on the fly. The simplest example is the amount of revenue, which consists of cash and non-cash money. Another striking example is data hierarchies. Because application development is progressive and it is not always known what will be needed in the future, the same hierarchy in meaning can be stored in different tables. And while on-the-fly acquisition is actively used in OLAP, these are slightly different things.
Excessive query complexity.

Because An OLTP scheme is different from an OLAP. A strongly connected software layer is needed that brings the OLTP data scheme to the right form.
Complexity of support, debugging and development.

In general, we can say that the more complex the code base, the more difficult it is to maintain it in working condition. This is an axiom.
The complexity of the test coverage.

A lot of copies were broken due to discussions on the topic of how to get a database full of all test scripts, but it’s better to say that having a simpler data scheme the task of covering with tests is simplified many times.
Endless performance debugging.

There is a high probability that the user will order a “dead load” report for the database server. This probability increases over time. It should be noted that OLAP is also prone to this problem, but unlike OLTP, the OLAP resource in this matter is much higher.

Columnstore

This article will focus on the columnstore storage format, but without low-level details. Other formats mentioned above also deserve attention, but this is a topic for another article.

Actually, the columnstore format has been known for about 30 years. But in RDBMS it was not implemented until recently. The essence of columnstore is that data is stored not in rows, but in columns. Those. on one page (all known 8 Kb), the server records data of only one field. And so with each field in the table in turn. This is necessary so that you do not have to read extra information. Let's imagine a table with 10 fields and a query that has only one field specified in the SELECT statement. If it were a regular table saved in a row-based format, the server would be forced to read all 10 fields, but only return one. It would turn out that the server read 9 times more information than was necessary. Columnstore completely solves this problem, because The storage format allows reading only one ordered field. All this happens because the storage unit in an RDBMS is a page. Those. the server always writes and reads at least one page. The only question is how many fields are present on it.

How Columnstore Can Really Help

To answer this one must have exact numbers. Let's get them. But what numbers can give an accurate picture?

The amount of disk space.
Query performance.
Fault tolerance.
Ease of implementation.
What new skills should a developer have to work with new structures.

Disk space

Let's create a simple table, fill it with data and check how much space it takes.

create foreign table cstore_table ( trd date, org int, op int, it int, wh int, m1 numeric(32, 2), m2 numeric(32, 2), m3 numeric(32, 2), m4 numeric(32, 2), m5 numeric(32, 2) ) server cstore_server options(compression 'pglz');

As you noticed, I created an external table. The fact is that PostgreSQL does not have built-in columnstore support. But PostgreSQL has a powerful system for extensions. One of them makes it possible to create columnstore tables. Links at the end of the article.

pglz - tells the extension that the data should be compressed using the built-in algorithm in PostgreSQL;
trd - transaction time;
op, it, wh — analytical sections or measurements;
m1, m2, m3, m4, m5 - numerical indicators or measures;

Let's insert a “decent” amount of data and see how much space it takes on disk. At the same time, we check the performance of the insert. Because I put my experiments on a home laptop, I am slightly organic in the amount of data. In addition, which is even good, I will use the HDD running the guest OS Fedora 30. OS host - Windows 10 Home Edition. Processor Intel Core 7. Guest OS received 4 GB of RAM. PostgreSQL version - PostgreSQL 10.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 9.1.1 20190503 (Red Hat 9.1.1-1), 64-bit. I will experiment with a data set with the number of records 52 608 000.

 explain (analyze) insert into cstore_table select '2010-01-01'::date + make_interval(days => d) as trd , op , org , wh , it , 100 as m1 , 100 as m2 , 100 as m3 , 100 as m4 , 100 as m5 from generate_series(0, 1) as op cross join generate_series(1, 2) as org cross join generate_series(1, 3) as wh cross join generate_series(1, 4000) as it cross join generate_series(0, 1095) as d;