postgres_exporter and monitoring PostgreSQL instances with multiple databases

Good afternoon, habr readers!







Prometheus and its ecosystem of exporters (agents) is a good tool for any administrator and developer. Ease of delivery, simplicity (relative) settings, the ability to use the automatic detection service.

But it will be not so much about Prometheus, but about one of the notable agents, namely postgres_exporter. It allows you to collect metrics with PostgreSQL. But if everything was so simple ...







Unfortunately, the postgres_exporter documentation is quite ascetic and only affects general cases. But what if you got an instance of a DBMS cluster with several databases and / or you want to collect metrics for several instances of the cluster at once.







goal



Actually, about the purpose of the article or rather notes. I note right away that here I will not describe the processes of assembly or configuration of Prometheus and postgres_exporter, their interaction. All this is described in the documentation and in many other sources.







I would like to touch upon some special cases of using postgres_exporter to solve specific problems, namely collecting metrics by one agent with:







  1. multiple databases in one instance;
  2. several copies;
  3. multiple databases on different instances.


postgres_exporter



Subjectively, the pros and cons.







From the pros:







  1. The first and important advantage is the ease of delivery and configuration of the agent. Agent - is an executable file (optionally, a yaml file with a set of user metrics). This is a self-contained application compiled for the necessary distribution and architecture, and does not require the installation of additional packages on the server. The agent can be installed both on the same node as the cluster instance, and on a separate node;
  2. The agent connects to the DBMS as a regular sql client. It is possible to connect via inet or through a unix socket;
  3. The ability to receive metrics by one agent, from several instances of instances and / or from several databases of one instance;
  4. Metrics are collected as often as requested by Prometheus or another collector;
  5. The ability to receive metrics with a simple HTTP request;
  6. Automatic receipt, by an agent, of a list of databases on a single PostgreSQL instance, the option --auto-discover-databases has appeared from postgres_exporter 0.5.0+.


Of the minuses:







  1. Lack of authorization;
  2. Data transfer only via HTTP. All metrics will be transmitted in clear text. And this is bad, since an attacker, when intercepted, can get a reliable list of databases and roles;
  3. Doesn't cache metrics. Thus, for example, when the agent is unavailable for the network, data for the unavailability period will not be received by Prometheus;
  4. When using the --auto-discover-databases option, it is not possible to exclude certain databases from the list. This is rather temporary, since in the next release such a possibility should already appear (option --exclude-databases).


Multiple Databases in One Instance



Well, let's move on to practice. Suppose we have an instance of PostgreSQL with several databases and we need to organize the collection of instance metrics and all the databases.

Why did I separate the collection of database metrics and cluster instance, everything is very simple, the scenario of postgres_exporter working with several databases on the same cluster implies the execution of the same set of sql queries in different databases. And as a result, when trying to get metrics from the views pg_stat_replication, pg_stat_activity, pg_stat_statement, etc. being common to the cluster, we always get, in the understanding of postgres_exporter, the same set of metrics that will lead to duplicate keys and values, which will lead to an error.

Let's see how it looks in practice.







We have a test instance with a set of databases:







List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+------------+------------+----------------------- dbtest1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | dbtest2 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | dbtest3 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 |
      
      





We start postgres_exporter, with the option --auto-discover-databases (if the database name is not specified in the connection string, then the connection will be made to the database with the username):







 $ DATA_SOURCE_NAME="postgresql://postgres@127.0.0.1:5432/?sslmode=disable" ./postgres_exporter --auto-discover-databases --log.format=logger:stdout
      
      







In the agent’s output, we observe an idyllic picture, it is launched and was able to connect to all the databases in the cluster (although it doesn’t write to which databases, we hope it will be fixed):







 INFO[0000] Established new database connection to "127.0.0.1:5432". source="postgres_exporter.go:788" INFO[0000] Established new database connection to "127.0.0.1:5432". source="postgres_exporter.go:788" INFO[0000] Semantic Version Changed on "127.0.0.1:5432": 0.0.0 -> 11.5.0 source="postgres_exporter.go:1251" INFO[0000] Semantic Version Changed on "127.0.0.1:5432": 0.0.0 -> 11.5.0 source="postgres_exporter.go:1251" INFO[0000] Established new database connection to "127.0.0.1:5432". source="postgres_exporter.go:788" INFO[0000] Semantic Version Changed on "127.0.0.1:5432": 0.0.0 -> 11.5.0 source="postgres_exporter.go:1251" INFO[0000] Established new database connection to "127.0.0.1:5432". source="postgres_exporter.go:788" INFO[0000] Semantic Version Changed on "127.0.0.1:5432": 0.0.0 -> 11.5.0 source="postgres_exporter.go:1251" INFO[0000] Established new database connection to "127.0.0.1:5432". source="postgres_exporter.go:788" INFO[0000] Semantic Version Changed on "127.0.0.1:5432": 0.0.0 -> 11.5.0 source="postgres_exporter.go:1251" INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1490"
      
      





I think an attentive reader will notice that there are four bases in the cluster (postgres, dbtest1, dbtest2 and dbtest3, template0 and template1 are ignored), and there are five connections. In our case, postgres_exporter creates two connections to the postgres database. And with this feature you need to be very careful. Why? We will find out about this a little further.







Well, let's continue and try to get the metrics:







 $ curl http://localhost:9178/metrics
      
      





As a result, in the output we get warnings about duplicates "was collected before with the same name and label values" (but in the postgres_exporter log we will not see warnings):







 ... * collected metric pg_stat_activity_max_tx_duration label:<name:"datname" value:"dbtest1" > label:<name:"server" value:"127.0.0.1:5432" > label:<name:"state" value:"fastpath function call" > gauge:<value:0 > was collected before with the same name and label values * collected metric pg_stat_bgwriter_checkpoints_timed label:<name:"server" value:"127.0.0.1:5432" > counter:<value:1 > was collected before with the same name and label values ...
      
      





The only way to get rid of errors is to disable the collection of metrics by default. There are two ways to do this: first set the environment variables PG_EXPORTER_DISABLE_DEFAULT_METRICS and PG_EXPORTER_DISABLE_SETTINGS_METRICS to true or use the options --disable-default-metrics and --disable-settings-metrics







Restart postgres_exporter with additional options:







 $ DATA_SOURCE_NAME="postgresql://postgres@127.0.0.1:5432/?sslmode=disable" ./postgres_exporter --auto-discover-databases --log.format=logger:stdout --disable-default-metrics --disable-settings-metrics
      
      





Trying to get the metrics:







 $ curl http://localhost:9178/metrics
      
      





And so, everything went according to plan, but there is not a single metric associated with PostgreSQL in the output:







 # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 ... # HELP process_virtual_memory_bytes Virtual memory size in bytes. # TYPE process_virtual_memory_bytes gauge process_virtual_memory_bytes 1.3832192e+07
      
      





Further, in order to get the payload, we need to create a file that describes which metrics we want to receive (do not forget, we can only collect database-specific metrics).







For the test, we will collect metrics from the pg_statio_user_tables relation. To do this, create a queries.yaml file with the following contents:







 pg_statio_user_tables: query: "SELECT current_database() as datname, schemaname, relname, heap_blks_read, heap_blks_hit FROM pg_statio_user_tables" metrics: - datname: usage: "LABEL" description: "Name of database" - schemaname: usage: "LABEL" description: "Name of the schema that this table is in" - relname: usage: "LABEL" description: "Name of this table" - heap_blks_read: usage: "COUNTER" description: "Number of disk blocks read from this table" - heap_blks_hit: usage: "COUNTER" description: "Number of buffer hits in this table"
      
      





I think here it is necessary to clarify one point, namely the addition of the name of the database in which the query is executed. This is a mandatory requirement, and there are at least two reasons for this:







  1. Databases may have tables with the same name, which will lead to an error due to duplication of metrics;
  2. Without this, you cannot identify which database the metric refers to, which will turn the collected data into garbage.


And so, we launch our agent with the option --extend.query-path (here the path to the yaml file with the description of the metrics is indicated):







 DATA_SOURCE_NAME="postgresql://postgres@127.0.0.1:5432?sslmode=disable" ./postgres_exporter --log.format=logger:stdout --auto-discover-databases --disable-default-metrics --disable-settings-metrics --extend.query-path=./queries.yaml
      
      





We are trying to get the metrics (for clarity, we only take pg_statio_user_tables_heap_blks_hit):







 curl -s http://localhost:9187/metrics | grep pg_statio_user_tables_heap_blks_hit
      
      





As a result, we get a uniquely interpreted set of metrics:







 # HELP pg_statio_user_tables_heap_blks_hit Number of buffer hits in this table # TYPE pg_statio_user_tables_heap_blks_hit counter pg_statio_user_tables_heap_blks_hit{datname="dbtest1",relname="t1",schemaname="public",server="127.0.0.1:5432"} 0 pg_statio_user_tables_heap_blks_hit{datname="dbtest1",relname="t2",schemaname="public",server="127.0.0.1:5432"} 0 pg_statio_user_tables_heap_blks_hit{datname="dbtest2",relname="t1",schemaname="public",server="127.0.0.1:5432"} 0 pg_statio_user_tables_heap_blks_hit{datname="dbtest2",relname="t2",schemaname="public",server="127.0.0.1:5432"} 0 pg_statio_user_tables_heap_blks_hit{datname="dbtest3",relname="t1",schemaname="public",server="127.0.0.1:5432"} 0 pg_statio_user_tables_heap_blks_hit{datname="dbtest3",relname="t2",schemaname="public",server="127.0.0.1:5432"} 0
      
      





As a result, we got the opportunity, using the --auto-discover-databases option, to collect metrics from all the databases of one instance of the cluster. A nice bonus is that when you add a new database, you do not need to restart the agent.

But with all this, we were left without instance metrics. The solution, at the moment, is only one - to use different agents to collect database and instance metrics.

Of course it doesn’t look very good, but it’s possible to level this nuisance by grouping agents to collect metrics from several instances. We will consider this, another, rather interesting opportunity below.







The answer to the riddle of the 'extra' connection

Remember, at the beginning, we drew attention to the "extra" connection. So, this is a feature of postgres_exporter with the option --auto-discover-databases.

But why can it cause a lot of trouble? In fact, everything is simple and already described above, namely, the problem is that postgres_exporter will collect metrics from the postgres database twice and begin to duplicate metrics. In our case, only the appearance of the --exclude-databases option can help (so we are looking forward to the next release).

And yes, if you have user tables in the postgres database, then the example above will not work.







Multiple instances



Well, move on. We figured out how to get metrics from several databases, now we will consider the option of how to monitor several instances with one agent. It is very simple, for this it is enough to list them in the environment variable DATA_SOURCE_NAME separated by a comma:







 $ DATA_SOURCE_NAME="postgresql://postgres@127.0.0.1:5432/postgres?sslmode=disable,postgresql://postgres@127.0.0.1:5434/postgres?sslmode=disable" ./postgres_exporter --log.format=logger:stdout
      
      





Here we connect to two different cluster instances running, in our case, on the local node. Here's what it looks like in the logs:







 INFO[0000] Established new database connection to "127.0.0.1:5432". source="postgres_exporter.go:788" INFO[0000] Semantic Version Changed on "127.0.0.1:5432": 0.0.0 -> 11.5.0 source="postgres_exporter.go:1251" INFO[0000] Established new database connection to "127.0.0.1:5434". source="postgres_exporter.go:788" INFO[0000] Semantic Version Changed on "127.0.0.1:5434": 0.0.0 -> 11.5.0 source="postgres_exporter.go:1251" INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1490"
      
      





Next, we try to get the metrics (for clarity, we restrict ourselves to the pg_stat_database_blk_read_time metric):







 curl -s http://localhost:9187/metrics | grep pg_stat_database_blk_read_time
      
      





As a result, from one agent, we get metrics for both instances:







 # HELP pg_stat_database_blk_read_time Time spent reading data file blocks by backends in this database, in milliseconds # TYPE pg_stat_database_blk_read_time counter pg_stat_database_blk_read_time{datid="1",datname="template1",server="127.0.0.1:5432"} 0 pg_stat_database_blk_read_time{datid="1",datname="template1",server="127.0.0.1:5434"} 0 pg_stat_database_blk_read_time{datid="13116",datname="template0",server="127.0.0.1:5432"} 0 pg_stat_database_blk_read_time{datid="13116",datname="template0",server="127.0.0.1:5434"} 0 pg_stat_database_blk_read_time{datid="13117",datname="postgres",server="127.0.0.1:5432"} 0 pg_stat_database_blk_read_time{datid="13117",datname="postgres",server="127.0.0.1:5434"} 0 pg_stat_database_blk_read_time{datid="16384",datname="dbtest1",server="127.0.0.1:5432"} 0 pg_stat_database_blk_read_time{datid="16385",datname="dbtest2",server="127.0.0.1:5432"} 0 pg_stat_database_blk_read_time{datid="16386",datname="dbtest3",server="127.0.0.1:5432"} 0
      
      





In this case, everything was somewhat simpler than in the case of several databases on one instance. At the same time, we still have the opportunity to receive global metrics from all instances.







Summary



And so, the third case indicated for purposes is a combination of the two described above, so I see no reason to bring it.







As a result, what we have in the bottom line, postgres_exporter, in my opinion, is quite an interesting and promising administrator tool for monitoring instances of the PostgreSQL cluster and the databases deployed on them. But due to its age, it is not without flaws that can be easily understood and forgiven.







Sources





Version, at the time of writing, v 0.5.1. Supported versions of PostgreSQL 9.4+ (version 9.1+ restriction noted in source code).








All Articles