📰 🌋 🔐 TabPy for working with data in ClickHouse from Tableau 👨🏿‍🎓 👨🏼‍💻 👰🏻

Building communication between brands and people is what we at Dentsu Aegis Network do every day, and data analysis is an integral part of this work. In some cases, this process does not require data science (although we do have it), then we use the Tableau BI platform. Its main goal is to give our employees and customers a convenient interface for consuming data without writing scripts, SQL queries, etc.

In this article, we will describe how we managed to solve the problem of Tableau interacting with ClickHouse .

General statement of the problem

We faced a classic challenge. We have people. They love fruit. Some people like one fruit, some like all fruits, and the rest can love any combination of fruits.

So, it is necessary to enable the user in the dashboard built in Tableau to arbitrarily select several fruits and see how many people like at least one fruit from the set. Of course, we didn’t have fruits, but people were real, it’s just that on “fruits” it is easier to understand the problem.

The amount of data in our case is quite large. There were 13 thousand different “fruits”. The most popular “fruit” had nearly 34 million fans. On average, 450 thousand people love each “fruit”. Total fruit lovers - 282 million.

First forehead solution

It so happened that the data for this task we had in PostgreSQL (PG) and in ClickHouse (CH). In PG there was a reference table on “fruits”, in CH - a large table with a structure: the identifier of the “fruit” and the identifier of the person who loves this “fruit”. There is no native connector for CH in Tableau, and I still did not want to transfer the data somewhere, because this would require serious reworking of the existing system.

We tried to connect Tableau to CH using the ODBC driver and see what happens.

Not all ODBC drivers are equally useful. We need a certain version in which the necessary part of the functionality works, but there is no guarantee that the rest will work if you suddenly need it.
We were not able to pull all the data into the Tableau extract, because it is 13,000 * 450,000 = 5,850,000,000 records.

Then we decided to use sampling inside the query to the CH database, that is, to make our estimate of the number of lovers of the selected combination of “fruits” not on all people, but on a five percent sample to make the extract smaller. Plus, we immediately made an inner join fetch from CH with a PG “fruit” directory to get the “fruit” names. It helped - our extract was able to be generated in 5 hours.

We needed to update the data in the dashboard once a day, so 5 hours of updating the extract, it seems, is okay - we will update at night. But in the future, we would need additional capacities: there should be more “fruits”; accordingly, the number and size of groups of people whose intersection we needed to calculate should also increase. Therefore, a long update of the extract is not our option at all.

In addition, another problem appeared due to sampling. It so happened that in different parts of the dashboard, the numbers that should supposedly coincide, we differed. This was due to the fact that in one place we counted the number of lovers of one fruit accurately, and in part with a combination of fruits - inaccurate. Neither we nor our users liked this result.

Then we decided not to create the extract at all. To avoid loading a huge amount of data, we divided the datasets and used a live connection for CH. Between datasets, a connection was established using the built-in Tableau Edit relationship functionality. The datasource PG was made primary and linked to CH as secondary, using the “fruit” identifier that was in both tables.

Thus, we were able to filter the secondary datasource using the primary (Data blending). But we were expected to fail, because after throwing the filter from one datasource to another, we had to use the function of counting people in the resulting subset (COUNTD), and Data blending has a limitation that simply does not allow it to be done. Such a function directly with such a data connection does not work in principle.

There is a workaround that helps circumvent this Tableau limitation, but it can be used on relatively small datasets, which is clearly not the case.

After that, we tried another option. The datasets were still split and used live connection for CH. Here the filter from the dataset with the description of “fruits” to the dataset with fans of “fruits” was thrown inside Tableau using set action. But this option ultimately did not fit because of the inconvenient UI. Instead of the familiar filter, the user would have to look at the whole list and select “fruits” via cntrl + click, while the apply function was absent when all the selected values were applied at once.

As a result, after all our attempts, we had to return to the option with extract and sampling, terribly slow and giving only an approximate answer.

Solution found

Obviously, we don’t need to pull all the data into the Tableau extract. The user is uncomfortable to see all the data at once - the number of people who love all the “fruits”. He needs a set of an average of 10 “fruits”. It is a pity that Tableau does not know how to do this.

There are guys in our team who write in Python. Therefore, we decided in our search to move in this direction and found TabPy .

TabPy is a web service that allows you to get the result of executing Python scripts inside costing in Tableau.

How it works:

Tableau interacts with TabPy and, in turn, with Python using the so-called Script Functions. Script Functions contain the Python script itself, the required data type of the result, and the arguments that we pass to this function. In our case, the arguments were identifiers of “fruits”, the number of lovers of which we wanted to count.
TabPy converts the received Script Functions text to a script and passes it to the interpreter. Connection to the CH base was registered by us inside the script.
Next, TabPy returns the result of the executed script back to Tableau.

In Script Functions, arguments are always passed as arrays; the result is also returned by an array.

Not everything worked right away. The main thing that we understood: writing a Python script directly in a calculated field in Tableau is not a good idea. For two reasons:

Inside Script Functions, it’s sometimes difficult to use the familiar Python syntax. For example, multiple quotes are not accepted.
Thinking about future support for the dashboard, we realized that if we need to somehow change the script, then each time we will have to change it in the Tableau book itself. And this is clearly not the best way, because we are doing our best to avoid manual support for dashboards.

Therefore, we used another thing - TabPy Client .

TabPy Client is a library that allows you to publish Python scripts on the TabPy server and then call them inside Tableau. When using it, instead of writing a script inside Tableau, we call the .py file on the TabPy server using the parameters specified in it, pass arguments to it, and execute it.

This approach solved our problems using TabPy and Tableau. The script is written and tested in a familiar development environment and stored separately from the book, which now does not require manual support.

To solve our specific problem, we had to do the following.

At first we tried to solve it without using the TabPy Client. In this case, a Calculation field of the following form was created inside Tableau:

IF FIRST () == 0

THEN

SCRIPT_INT ("

from clickhouse_driver import Client

client = Client (host = host_name, database = database_name, user = user_name, password = password)

----- script_text -----

", SUM ([people]), ATTR ([fruits_id]))

End

It worked, but there were problems that were described above. When we figured out the TabPy Client, we realized that by dividing the Calculation field and the script itself, we get a more convenient and correct system. So the Calculation field and the .py file with the script looked like:

Calculation field	SCRIPT_INT (" return tabpy.query ('people_count_test', _ arg1, _arg2) ['response'] ", SUM ([people]), ATTR ([fruits_id]))
.Py file	from clickhouse_driver import Client import tabpy_client connection = tabpy_client.Client ('http: // localhost: 9004 /') def unique_people_count (people, fruits_id): client = Client (host = host_name, database = database_name, user = user_name, password = password) ----- script_text ----- connection.deploy ('people_count_test', unique_people_count, 'comment', override = True)

Here you can see that 'people_count_test' is the identifier for the TabPy Client, thanks to which it is clear which script to execute in this Calculation field.

And in the end, it was this approach that completely suited us.

Total

Users are satisfied because they can arbitrarily choose a combination of “fruits” and quickly get the number of fans of at least one of them, and the numbers in different parts of the dashboard are the same.

BI developers are pleased that you can work with ClickHouse from Tableau, without having to connect to it directly.

Our Tableau Server is pleased that you do not need to make a huge extract at night.

In general, TabPy gives BI developers more freedom to work with data when Tableau does not have a suitable solution out of the box. For example, to embed data science models directly in Tableau, but that's another story altogether ...

The article was written jointly with my colleagues Dimitry Scherbenko ( dima_vs ) and Sukhoveev Ivan ( suho_v ) R&D Dentsu Aegis Network Russia.

TabPy for working with data in ClickHouse from Tableau

General statement of the problem

First forehead solution

Solution found

Total

More articles: