How did we update Zabbix

Why do we love Prometheus ? He has a config - he looked and everything is clear, the program does what she was told. You can automate monitoring settings, store them in VCS, and review the command. Constricted your MR , worked pipeline, a new config applied to prometheus. In general, IaC in all its glory.

Speaking of prometheus. Do you use it for your iron infrastructure? So we do not use.

Like many who have been monitoring for a long time and who have “bare” hardware, we use Zabbix , which, by the way, is located on that hardware. Alas, at the moment, Zabbix and IaC are unrelated things. Zabbix can be configured either manually or through the API.

Background

In October 2018, Zabbix-4.0 was released - a new LTS branch. And in mid-March, we started planning to update our installation of version 3.4 on it.

There were almost no special problems with 3.4:

Sometimes some LLD didn’t work somewhere and the Impossible happened, which is unclear how to debug on a version unsupported by the developer
The memory of HTTP pullers constantly flowed - as a result of which a carefully configured systemd nailed monitoring and started it again. The problem was masked by a decent amount of server memory. The problem is well-known, documented .

And in 4.0, interesting features appeared, like native HTTP items and service periods, not for the whole host.

And where is it seen, sitting on an irrelevant version of monitoring, not even on LTS? We must keep up to date.

Moreover, when planning the update, an interesting detail was found out: progress does not stand still, you can take faster cars at a lower price. And along the way, there was a way to save on the already unnecessary hoster service in several projects of colleagues. As they say, we successfully entered it.

Forehead Update

There is nothing particularly complicated in updating the zabbix now. Order a server, configure it, merge a copy of the database. Put monitoring packages and show them the base, run zabbiks - and he will update everything for you himself, roll all the migrations. Well, yes, you probably know how easy the Zabbix upgrade has become.

In total, the database migration took about 15 minutes, and even without much abuse. And everything seems to be fine. Yes? No matter how! Despite the fact that the IP of the new server is not listed in the white lists of agents, and it collects data from only a few test hosts, the Impossible is still happening on it.

To the credit of the developers of Zabbix, I must say, they keep their word - version 4.2 is supported at that time. After talking in the project tracker, we find out that the reason for the impossible is that it does not coincide with the expected structure of one of the database tables.

Vague doubts creep in. It will be useful to recall that historically the “thickest” Zabbix database tables have been partitioned. First of all, for performance reasons, in order to somehow cover up your favorite zabbix callus - deleting historical data in the RDBMS . We compare the structures of all the tables in a row in a freshly updated database and a control one - created by the server itself from scratch. The fears were confirmed. In addition to the absence of some constants in the database, in many tables, many digital columns are of the wrong type.

That is, in fact, we have not a base scheme supported by the developers, but our own “fork”. Another type of column data is, potentially:

other metric storage cost
different accuracy of numbers
different sampling / recording speed of metrics

Think for the better? It is doubtful. According to past experience with technical support and zabbix developers, they can tune DBMSs.

And this type of column data is possible, but difficult and long to change. And it is impossible without a long downtime of monitoring. Without guarantees of success, without support from the developer in the future. Need another way.

And Zabbix has it. Because in April 2019, zabbix-4.2 is coming out

The second approach to the projectile

The main feature of 4.2 for us is support for partitioning “out of the box” through TimescaleDB . After talking with the representatives of Zabbix and familiarizing ourselves with the results of testing its technical support for this feature ( translation on the hub), we decide to test the installation with timescaledb and, based on the results, make a decision about the transition. More specifically: for some time, all monitoring data will be written in parallel to both the old and the new version. And then we just switch the DNS entry.

Of course, this approach does not allow you to save historical data and trends - the new database is populated from scratch. But are they really needed? History matters only here and now, it will accumulate quite quickly again (look at the same prometheus). There remains only the undoubted utility of trends for capacity planning. In any case, the archive with the already collected data remains with us (looking ahead - it was useful to some clients). Another feature of timescaledb support in zabbix is that individual periods of history / trend storage are no longer valid.

We have clients who insist on "eternal" storage of all collected data at all costs. We can offer them to consider the installation / support of a separate monitoring installation with specific settings. Our main task is to ensure the stable operation of client projects / servers while maintaining an acceptable cost of services, which includes monitoring as well.

In total, the following steps will be required for migration:

Install and configure a second monitoring installation
Get in it all the same as in the first installation
Switch!

Sounds easy, right? Indeed, the first is not very difficult, because during the previous approach we wrote a role for installing a zabbix server, just upload the configuration. The third item also looks simple - switch DNS and all zabbiks agents, proxies, API clients and live people get to the new version. But how to make the second point?

At first we tried a naive approach. Imported from the current monitoring a couple of the most commonly used templates. Using the already written scripts for working with the API, we started the same projects in the new monitoring as in the current one, pushed edits through the SCM systems, adding the IP of the new machine to the packet filter and to the Server / ServerActive agents directives. It even worked - a lot of hosts began to register in two monitoring at once, a new one assigned them templates and started collecting data in parallel with the current one.

Alas, this is precisely the naive approach to migration, suitable only for the test. The resulting load (in nvps ) could not be compared with the current installation, being lower by orders of magnitude. It is understandable. In our case, monitoring is literally the years of work of many people and scripts, the quintessence of experience in operating heterogeneous projects.

For example, what about manually wound up users and their passwords that are generated randomly when creating projects, monitoring templates hung on hosts (with their custom macro values), manually created items, complex screens, graphs, dashboards, service periods, proxies? All this and much more needs to be transferred for smooth migration.

Fortunately, the zabbix has a built-in functionality for exporting / importing objects - also available through the API. Alas, it covers no more than half of all existing facilities. And the code for its use also needs to be written. In general, you can’t just take and import the configuration of one zabbiks into another.

Or is it possible?

Here the brain helpfully recalls the task from the backlog that it would be nice to organize the storage of the monitoring configuration history by external means. Alas, this is a sore spot of Zabbix. With reference to the article on the hub and the repository with code. But there are nuances:

the code does not export all monitoring objects to human-readable YAML files (in particular, not all we need)
the code does not support importing objects

Fortunately, there are people who know a little the project language (python) and have experience with the Zabbix API. The only business is to import objects from ready-made YAML dumps. For how long, for short, but after three weeks of work and one and a half hundred commits, a fork was quite suitable for our purposes. Actually, for the sake of whose idea the whole article is written:

https://github.com/centosadmin/zabbix-review-export-import

What is done:

Added support for many new objects
The format of YAML export of most of the existing objects has been changed so that they can be imported
Added the ability to import most of the exported objects
Added limited functionality for converting objects between different versions of zabbix (as in our case)

Import is supported almost exclusively by the creation of new objects. If an object exists, it will not be modified. This allowed us to keep the complexity of the code at least in some frameworks, save time and coolly increase the speed of work - when importing thousands of objects. Using import is very simple:

./zabbix-import.py /path/to/file.yaml

(it is assumed that the target monitoring parameters are specified in the environment variables, for more details, see the --help output)

In general, you can specify any number of input YAML files - and all of them will be processed. But taking into account that there are many dependencies between objects, it makes more sense to import objects type by type, starting with the simplest and most basic ones. Plus, if you import one object from one file, it may make sense to explicitly specify its type in order to speed up the import a little - not all caches are loaded, but only the necessary ones.

Thus, two repositories appeared in our hitlab with periodically updated YAML dumps of two monitoring versions, the current and the new. And, of course, the ability to restore almost any monitoring object at any time.

Continuous monitoring deployment and migration itself

As a result, we came to the conclusion that the gitlab on schedule launches a pipeline on the new monitoring repository, which, step by step, hierarchically imports from the old monitoring one type of objects after another. This allowed us to import the vast majority of objects and give our teams of administrators time to calmly fix the problems that got out - and not so few have accumulated over the years. "Extra" objects were not deleted.

The issue with user passwords - they are also exported / imported, but a random password is assigned during creation - could be solved by converting the SQL dump of the table with the credentials of the current monitoring into SQL statements to set the correct passwords in the new monitoring.

In order not to receive a double portion of tasks during parallel operation, all actions in the new monitoring were immediately turned off and were no longer deleted.

Thus, the switch was pretty easy and came down to the following points:

delete all hosts in the new monitoring (a couple of scripts over the API are written for this)
pull SCMs to update zabbix-proxy version and switch proxies to a new server
wait for the import of hosts from the dump of the old monitoring
switch DNS records

(plan shortened to simplify)

What's next?

Of course, the code is not perfect and not particularly beautiful. It does not import everything, in particular, there are problems with some templates - look for FIXME in the code. But that was enough for us. Perhaps this fork is useful to someone else. A logical continuation is the development of a similar operation of the Terraform utility, when the target monitoring is completely reduced to the form specified, for example, by a directory with YAML dumps. Including with the reduction of existing facilities to the desired form.

This will allow you to calmly wait for the "native" HA support in zabbix, having two servers, the settings are synchronized between them automatically. Now you have to keep a replica, proxies and write scripts.

Camo coming?

After studying the materials of conferences and meetings, the official roadmap, the error tracker and the (modest) experience of personal communication with the developers of the zabbix, it seems that they perfectly understand the situation they are in now. When the zabbix started, the authors did not think about any IaC while solving their problems. A decade later, the product matured and flourished. The flip side to success was the mass of clients of the company, whose monitoring solves their problems. And who do not really like the revolution. In modern conditions, they, on the one hand, are against breaking everything and starting from scratch. On the other hand, they sometimes lack monitoring capabilities and look “to the side”, not forgetting to voice the Wishlist to the developers of the Zabbix. The company is not going to risk them, despite any sympathy for the new, convenient, fashionable, youth.

We will not see a new, correct prometheus from Zabbix in the near future. No matter how I would like. But the work is clearly going on - and if you, like the Zabbix, are thorough and patient, the future also expects a cloudless future.

Used sources:

All Articles