PostgreSQL monitoring basics. Alexey Lesovsky

I suggest that you read the transcript of the report by Alexey Lesovsky from Data Egret "Basics of PostgreSQL Monitoring"

In this report, Alexey Lesovsky will talk about the key points of postgres statistics, what they mean, and why they should be included in the monitoring; about what charts should be in monitoring, how to add them and how to interpret. The report will be useful for database administrators, system administrators and developers who are interested in Postgres troubleshooting.

PostgreSQL monitoring basics. Alexey Lesovsky

My name is Alexey Lesovsky, I represent Data Egret.

A few words about myself. I started a long time ago as a system administrator.

I administrated all sorts of different Linux, did various things related to Linux, i.e. virtualization, monitoring, worked with proxies, etc. But at some point I became more involved in databases, PostgreSQL. I really liked him. And at some point, I began to deal with PostgreSQL most of my working time. And so gradually I became a PostgreSQL DBA.

And throughout my career, I have always been interested in the topics of statistics, monitoring, telemetry. And when I was a system administrator, I worked very hard on Zabbix. And wrote a small set of scripts like zabbix-extensions. He was quite popular in his time. And there it was possible to monitor very different important things, not only Linux, but also various components.

Now I'm already doing PostgreSQL. I am already writing another thing that allows you to work with PostgreSQL statistics. It is called pgCenter (article on HabrΓ© - Postgres stat without nerves and tension).

PostgreSQL monitoring basics. Alexey Lesovsky

A small introduction. What are the situations with our customers, with our clients? There is some kind of accident associated with the database. And when the database has already been restored, the head of the department or the head of development comes and says: β€œFriends, we should monitor the database, because something bad happened and it is necessary that this does not happen in the future.” And here begins the interesting process of choosing a monitoring system or adapting an existing monitoring system so that you can monitor your database - PostgreSQL, MySQL or some others. And colleagues begin to offer: β€œI heard that there is such and such a database. Let's use it." Colleagues start arguing with each other. And in the end it turns out that we choose some kind of database, but PostgreSQL monitoring is rather weakly represented in it and we always have to finish something. Take some repositories from GitHub, clone them, adapt scripts, somehow tune them up. And in the end it falls out into some kind of manual work.

PostgreSQL monitoring basics. Alexey Lesovsky

Therefore, in this report, I will try to give you some knowledge on how to choose monitoring not only for PostgreSQL, but also for the database. And to give the knowledge that will allow you to finish your monitoring in order to get some benefit from it, so that you can monitor your database with benefit in order to prevent any upcoming emergency situations that may arise in time.

And those ideas that will be in this report, they can be directly adapted to any database, be it a DBMS or noSQL. Therefore, not only PostgreSQL here, but there will be many recipes on how to do this in PostgreSQL. There will be examples of queries, examples of entities that PostgreSQL has for monitoring. And if your DBMS has the same things that allow you to put them into monitoring, you can also adapt them, add them and it will be fine.

PostgreSQL monitoring basics. Alexey LesovskyI will not report
talk about how to deliver and store metrics. I will not say anything about post-processing data and providing it to the user. And I will not say anything about alerting.
But in the course of the story, I will show different screenshots of existing monitorings, somehow I will criticize them. Nevertheless, I will try not to name brands so as not to create advertising or anti-advertising for these products. Therefore, all coincidences are random and remain on your imagination.
PostgreSQL monitoring basics. Alexey Lesovsky
First, let's understand what monitoring is. Monitoring is a very important thing to have. Everyone understands this. But at the same time, monitoring is not related to a business product and does not directly affect the company's profits, so monitoring is always given time on a residual basis. If we have time, then we are engaged in monitoring, if there is no time, then OK, we will put it in the backlog and someday we will return to these tasks.

Therefore, from our practice, when we come to clients, monitoring is often underdeveloped and does not have any interesting things that would help us do a better job with the database. And therefore monitoring always needs to be finished.

Databases are such complex things that you also need to monitor, because databases are a repository of information. And the information is very important for the company, it cannot be lost in any way. But at the same time, databases are very complex pieces of software. They are made up of many components. And many of these components need to be monitored.

PostgreSQL monitoring basics. Alexey LesovskyIf we are talking specifically about PostgreSQL, then it can be represented as such a scheme, which consists of a large number of components. These components interact with each other. And at the same time, PostgreSQL has the so-called Stats Collector subsystem, which allows you to collect statistics about the operation of these subsystems and provide an interface to the administrator or user so that he can view these statistics.

This statistics is presented in the form of some set of functions and views (view). They can also be called tables. That is, using a regular psql client, you can connect to the database, select these functions and views, and get some specific numbers about the operation of PostgreSQL subsystems.

You can add these numbers to your favorite monitoring system, draw graphs, add features and get analytics in the long run.

But in this report, I will not cover all of these functions without exception, because it can take a whole day. I will refer to literally two, three or four things and I will tell you how they help make monitoring better.
PostgreSQL monitoring basics. Alexey Lesovsky
And if we talk about monitoring the database, then what needs to be monitored? First of all, we need to monitor availability, because the database is a service that provides access to data to clients and we need to monitor availability, and also provide some of its qualitative and quantitative characteristics.

PostgreSQL monitoring basics. Alexey Lesovsky

We also need to monitor the clients that connect to our database, because they can be both normal clients and harmful clients that can harm the database. They also need to be monitored and tracked.

PostgreSQL monitoring basics. Alexey Lesovsky

When clients connect to the database, it is obvious that they start working with our data, so we need to monitor how clients work with data: with which tables, to a lesser extent with which indexes. That is, we need to evaluate the workload that is created by our clients.

PostgreSQL monitoring basics. Alexey Lesovsky

But the workload also consists, of course, of requests. Applications connect to the database, access data using queries, so it is important to evaluate what queries we have in the database, monitor their adequacy, that they are not crookedly written, that some options need to be rewritten and made so that they work faster and with better performance.

PostgreSQL monitoring basics. Alexey Lesovsky

And since we are talking about the database, the database is always background processes. Background processes keep the database performance at a good level, so they require a certain amount of resources for themselves to run. And at the same time, they can overlap with client request resources, so the greedy work of background processes can directly affect the performance of client requests. Therefore, they also need to be monitored and tracked so that there are no distortions in terms of background processes.

PostgreSQL monitoring basics. Alexey Lesovsky

And that's all in terms of database monitoring remains in the system metric. But given that for the most part our entire infrastructure goes to the clouds, the system metrics of an individual host always fade into the background. But in databases, they are still relevant and, of course, it is also necessary to monitor system metrics.

PostgreSQL monitoring basics. Alexey Lesovsky

With system metrics, everything is more or less fine, all modern monitoring systems already support these metrics, but in general, some components are still not enough and some things need to be added. I will also touch on them, several slides will be about them.

PostgreSQL monitoring basics. Alexey Lesovsky
The first point of the plan is accessibility. What is accessibility? Availability in my understanding is the ability of the base to serve connections, that is, the base is raised, it, as a service, accepts connections from clients. And this accessibility can be assessed by some characteristics. These characteristics are very convenient to display on dashboards.

PostgreSQL monitoring basics. Alexey Lesovsky
Everyone knows what dashboards are. This is when you took one look at the screen, which summarized the necessary information. And you can already immediately determine whether there is a problem in the database or not.
Accordingly, the availability of the database and other key characteristics should always be put on dashboards so that this information is at hand, always with you. Some additional details that already help in the investigation of incidents, in the investigation of some emergency situations, they already need to be placed on secondary dashboards, or hidden in drilldown links that lead to third-party monitoring systems.

PostgreSQL monitoring basics. Alexey Lesovsky

An example of one known monitoring system. This is a very cool monitoring system. It collects a lot of data, but from my point of view, it has a strange concept of dashboards. There is a link "Create Dashboard". But when you create a dashboard, you create a two-column list, a list of charts. And when you need to look at something, you start clicking, scrolling, looking for the desired chart with the mouse. And this takes time, i.e. there are no dashboards as such. There are only lists of charts.

PostgreSQL monitoring basics. Alexey Lesovsky

What should be added to these dashboards? You can start with such a characteristic as response time. PostgreSQL has the pg_stat_statements view. It is disabled by default, but it is one of the important system views that should always be enabled and used. It stores information about all running queries that were executed in the database.

Accordingly, we can start from the fact that we can take the total execution time of all requests and divide by the number of requests using the above fields. But this is such an average temperature in the hospital. We can build on other fields - the minimum query execution time, the maximum and the median. And we can even build percentiles, PostgreSQL has the corresponding functions for this. And we can get some numbers that characterize the response time of our database for already completed requests, i.e. we do not execute the fake 'select 1' request and watch the response time, but we analyze the response time for already completed requests and draw either a separate figure, or we build a graph based on it.

It is also important to keep track of the number of errors that the system is currently generating. And for this you can use the pg_stat_database view. We are targeting the xact_rollback field. This field not only shows the number of rollbacks that occur in the database, but also takes into account the number of errors. Relatively speaking, we can display this figure in our dashboard and see how many errors we have at the moment. If there are a lot of errors, then this is already a good reason to look into the logs and see what kind of errors they are and why they occur, and then invest and solve them.

PostgreSQL monitoring basics. Alexey Lesovsky

You can add such a thing as a Tachometer. These are the number of transactions per second and the number of requests per second. Relatively speaking, you can use these numbers as the current performance of your database and observe whether there are peaks in requests, peaks in transactions, or, conversely, the database is underloaded because some kind of backend fell off. It is important to always look at this figure and remember that for our project such a performance is normal, and the values ​​​​above and below are already some kind of problematic and incomprehensible, which means we need to look at why such numbers.

In order to estimate the number of transactions, we can again refer to the pg_stat_database view. We can add the number of commits and the number of rollbacks to get the number of transactions per second.

Everyone understands that several requests can fit into one transaction? Therefore TPS and QPS are slightly different.

The number of requests per second can be obtained from pg_stat_statements and simply calculate the sum of all executed requests. It is clear that we compare the current value with the previous one, subtract, get the delta, get the amount.

PostgreSQL monitoring basics. Alexey Lesovsky

You can add additional metrics if you wish, which also help to assess the availability of our database and track if there were any downtime.

One of these metrics is uptime. But uptime in PostgreSQL is a bit tricky. I'll tell you why. When PostgreSQL starts up, it starts reporting uptime. But if at some point, for example, some task was running at night, an OOM-killer came and forcibly terminated the PostgreSQL child process, then in this case PostgreSQL terminates the connection of all clients, resets the sharded memory area and starts recovery from the last checkpoint. And while this recovery from the checkpoint lasts, the database does not accept connections, that is, this situation can be assessed as downtime. But this will not reset the uptime counter, because it takes into account the time the postmaster was started from the very first moment. Therefore, such situations can be skipped.

You should also monitor the number of vacuum workers. Everyone knows what is autovacuum in PostgreSQL? This is an interesting subsystem in PostgreSQL. Many articles have been written about it, many reports have been made. Lots of discussion about vacuum, how it should work. Many consider it a necessary evil. But it is. This is some kind of garbage collector that cleans up obsolete versions of rows that are not needed by any of the transactions and frees up space in tables and indexes for new rows.

Why should it be monitored? Because the vacuum sometimes hurts a lot. It consumes a large amount of resources and client requests begin to suffer from this.

And it should be monitored through the pg_stat_activity view, which I will talk about in the next section. This view shows the current activity in the database. And through this activity, we can track the number of vacuums that are working right now. We can monitor vacuums and see that if we have exceeded the limit, then this is an occasion to look into the PostgreSQL settings and somehow optimize the operation of the vacuum.

Another feature of PostgreSQL is that PostgreSQL is very sick of long transactions. Especially, from transactions that hang for a long time and do nothing. These are the so-called stat idle-in-transaction. Such a transaction holds locks, it prevents the vacuum from working. And as a result - the tables swell, they increase in size. And queries that work with these tables, they start to work more slowly, because you need to shovel all the old versions of rows from memory to disk and back. Therefore, the time, the duration of the longest transactions, the longest vacuum requests also need to be monitored. And if we see some processes that have been running for a very long time, for more than 10-20-30 minutes for an OLTP load, then we need to pay attention to them and force them to end, or optimize the application so that they are not called and do not hang so long. For an analytical load, 10-20-30 minutes is normal, there are also longer ones.

PostgreSQL monitoring basics. Alexey Lesovsky
Next we have the option with connected clients. When we have already formed a dashboard, posted key accessibility metrics on it, we can also add additional information about connected clients there.

The information about connected clients is important because, from the point of view of PostgreSQL, there are different types of clients. There are good clients and there are bad clients.

A simple example. By client, I mean the application. The application has connected to the database and immediately starts sending its requests there, the database processes and executes them, and returns the results to the client. These are good and right clients.

There are situations that the client is connected, it keeps the connection, but does nothing. It is in the idle state.

But there are bad clients. For example, the same client connected, opened a transaction, did something in the database, and then went into the code, for example, to access an external source or to process the received data there. But at the same time, he did not close the transaction. And the transaction hangs in the database and holds the lock on the line. This is a bad state. And if suddenly the application somewhere inside it falls by an exception (Exception), then the transaction can remain open for a very long time. And this directly affects the performance of PostgreSQL. PostgreSQL will run slower. Therefore, it is important to track such clients in time and terminate their work forcibly. And you need to optimize your application so that there are no such situations.

Other bad clients are waiting clients. But they become bad due to circumstances. For example, a simple idle transaction: it can open a transaction, take locks on some lines, then it will fall somewhere in the code, leaving a hanging transaction. Another client will come, request the same data, but it will encounter a lock, because that hanging transaction already holds locks on some necessary rows. And the second transaction will hang in anticipation when the first transaction is completed or its administrator forcibly closes it. Thus, pending transactions can accumulate and overflow the database connection limit. And when the limit is full, the application can no longer work with the database. This is already an emergency situation for the project. Therefore, bad customers need to be tracked and responded to in a timely manner.

PostgreSQL monitoring basics. Alexey Lesovsky

Another example of monitoring. And here is a decent dashboard. There is information on connections from above. DB connection - 8 pieces. And it's all. We have no information about which clients are active, which clients are just idle, doing nothing. There is no information about hanging transactions and pending connections, i.e. this is such a figure that shows the number of connections and that's it. And then guess for yourself.
PostgreSQL monitoring basics. Alexey Lesovsky
Accordingly, to add this information to monitoring, you need to refer to the pg_stat_activity system view. If you spend a lot of time in PostgreSQL, then this is a very good view that should become your friend, because it shows the current activity in PostgreSQL, i.e. what is happening in it. There is a separate line for each process that shows information on this process: from which host the connection was made, under what user, under what name, when the transaction was started, what request is currently being executed, what request was executed last. And, accordingly, we can evaluate the state of the client by the stat field. Relatively speaking, we can group by this field and get those stats that are now in the database and the number of connections that are with this stat in the database. And we can send the already received numbers to our monitoring and draw graphs on them.
It is also important to evaluate the duration of the transaction. I have already said that it is important to evaluate the duration of vacuums, but transactions are also evaluated in the same way. There are xact_start and query_start fields. They, relatively speaking, show the start time of the transaction and the start time of the request. We take the now() function, which shows the current timestamp, and subtract the transaction and request timestamps. And we get the duration of the transaction, the duration of the request.

If we see long transactions, we should complete them already. For an OLTP load, long transactions are already more than 1-2-3 minutes. For an OLAP load, long transactions are normal, but if they run for more than two hours, then this is also a sign that we have a skew somewhere.

PostgreSQL monitoring basics. Alexey Lesovsky
Once the clients have connected to the database, they start working with our data. They access tables, they access indexes to get data from a table. And it is important to evaluate how customers work with this data.

This is necessary in order to evaluate our workload and roughly understand which tables we have the β€œhottest”. For example, this is necessary in situations where we want to place "hot" tables on some kind of fast SSD storage. For example, some archive tables that we have not used for a long time can be transferred to some kind of β€œcold” archive, to SATA disks and let them live there, they will be accessed as needed.

It is also useful for detecting anomalies after any releases and deployments. Let's say the project rolled out some new feature. For example, we added new functionality for working with the database. And if we build graphs for the use of tables, we can easily detect these anomalies on these graphs. For example, update bursts or delete bursts. It will be very visible.

It is also possible to detect anomalies of "floated" statistics. What does it mean? PostgreSQL has a very strong and very good query planner. And the developers devote a lot of time to its development. How does he work? In order to build good plans, PostgreSQL collects statistics about the distribution of data in tables with a certain interval of time, with some periodicity. These are the most frequent values: the number of unique values, information about NULL in the table, a lot of information.

Based on these statistics, the planner builds several queries, chooses the most optimal one, and uses this query plan to execute the query itself and return data.

And it happens that the statistics β€œfloats”. The quality and quantity data somehow changed in the table, but the statistics were not collected. And the plans formed may not be optimal. And if our plans turn out to be suboptimal in terms of the monitoring being collected, according to the tables, we will be able to see these anomalies. For example, somewhere the data changed qualitatively and instead of the index, a sequential pass through the table began to be used, i.e. if the query needs to return only 100 rows (there is a limit of 100), then a full enumeration will be performed for this query. And this always has a very bad effect on performance.

And we can see it in monitoring. And already look at this query, perform explain for it, collect statistics, build a new additional index. And already respond to this problem. Therefore it is important.

PostgreSQL monitoring basics. Alexey Lesovsky

Another example of monitoring. I think a lot of people recognize him because he is very popular. Who uses in their projects Prometheus? And who uses this product in conjunction with Prometheus? The fact is that in the standard repository of this monitoring there is a dashboard for working with PostgreSQL - postgres_exporter Prometheus. But there is one bad detail here.

PostgreSQL monitoring basics. Alexey Lesovsky

There are several charts. And bytes are specified as unity, i.e. there are 5 graphs. These are Insert data, Update data, Delete data, Fetch data and Return data. Bytes are specified as the unit dimension. But the fact is that statistics in PostgreSQL returns data in tuple (rows). And, accordingly, these graphs are a very good way to underestimate your workload several times, dozens of times, because a tuple is not a byte, a tuple is a string, it is a lot of bytes and it is always of variable length. That is, calculating the workload in bytes using tuples is an unrealistic task or a very difficult one. Therefore, when you use a dashboard or built-in monitoring, it is always important to understand that it works correctly and returns correctly evaluated data to you.

PostgreSQL monitoring basics. Alexey Lesovsky

How to get statistics on these tables? To do this, PostgreSQL has a family of views. And the main view is pg_stat_user_tables. User_tables - this means that the tables are created on behalf of the user. In contrast, there are system views, which are used by PostgreSQL itself. And there is a summary table Alltables, which includes both system and user. You can start from any of them that you like the most.

The above fields can be used to estimate the number of inserts, updates, and deletes. The example dashboard that I used uses these fields to evaluate the characteristics of the workload. Therefore, we can also build on them. But it is worth remembering that these are tuples, not bytes, so we cannot take it and make it bytes.

Based on these data, we can build the so-called TopN-tables. For example, Top-5, Top-10. And you can keep track of those hot tables that are utilized more than others. For example, 5 "hot" tables for insertion. And according to these TopN-tables, we evaluate our workload and can evaluate bursts of workload after any releases and updates, and deployments.

It is also important to evaluate the size of the table, because sometimes developers roll out a new feature, and our tables begin to swell in their large sizes, because they decided to add an additional amount of data, but did not predict how this would affect the size of the database. Such cases also come as surprises to us.

PostgreSQL monitoring basics. Alexey Lesovsky

And now a little question for you. What is the question when you notice the load on the database server? What is your next question?

PostgreSQL monitoring basics. Alexey Lesovsky

But the real question is the following. What requests does the load cause? That is, it is not interesting to watch the processes that the load causes. It is clear that if host is with a database, then the database is running there and it is clear that only databases will be disposed of there. If we open Top, we will see there a list of PostgreSQL processes that are doing something. From Top it will not be clear what they are doing.

PostgreSQL monitoring basics. Alexey Lesovsky

Accordingly, you need to find those queries that cause the most load, because query tuning, as a rule, gives more profit than PostgreSQL configuration or operating system tuning, or even hardware tuning. According to my estimate, this is about 80-85-90%. And this is done much faster. It is faster to correct the request than to correct the configuration, schedule a restart, especially if the database cannot be restarted, or add hardware. It's easier to rewrite the query somewhere or add an index to get a better result from this query.

PostgreSQL monitoring basics. Alexey Lesovsky
Accordingly, it is necessary to monitor requests and their adequacy. Let's take another example of monitoring. And here, too, seems to be excellent monitoring. There is information on replication, there is information on throughput, blocking, resource utilization. Everything is fine, but there is no information on requests. It is not clear what queries are running in our database, how long they run, how many of these queries. We need to always have this information in monitoring.

PostgreSQL monitoring basics. Alexey Lesovsky

And to get this information, we can use the pg_stat_statements module. On its basis, you can build a variety of graphics. For example, you can get information on the most frequent requests, that is, on those requests that are performed most often. Yes, after deploys it is also very useful to look at it and understand if there is any surge in requests.

You can monitor the longest queries, i.e. those queries that run the longest. They run on the processor, they consume I/O. We can also evaluate this by the fields total_time, mean_time, blk_write_time and blk_read_time.

We can evaluate and monitor the heaviest requests in terms of resource usage, those that read from disk, those that work with memory, or, on the contrary, create some kind of write load.

We can evaluate the most generous requests. These are the queries that return a large number of rows. For example, it may be some kind of request where they forgot to set a limit. And it just returns the entire content of the table or query on the requested tables.

And you can also monitor queries that use temporary files or temporary tables.

PostgreSQL monitoring basics. Alexey Lesovsky
And we still have background processes. Background processes are primarily checkpoints or they are also called checkpoints, these are autovacuum and replication.

PostgreSQL monitoring basics. Alexey Lesovsky

Another example of monitoring. There is a Maintenance tab on the left, go to it and hope to see something useful. But here, only the time of the vacuum and the collection of statistics, nothing else. This is very poor information, so you always need to have information about how background processes work in our database and whether there are any problems from their work.

PostgreSQL monitoring basics. Alexey Lesovsky

When we look at checkpoints, it should be remembered that our checkpoints flush "dirty" pages from the sharded memory area to disk, then create a checkpoint. And this checkpoint can already be used as a place during recovery, if PostgreSQL was suddenly terminated in an emergency.

Accordingly, in order to flush all the "dirty" pages to disk, you need to do a certain amount of writing. And, as a rule, on systems with a large amount of memory, this is a lot. And if we make checkpoints very often in some short interval, then disk performance will sag very much. And client requests will suffer from a lack of resources. They will compete for resources and lack productivity.

Accordingly, through pg_stat_bgwriter on the specified fields, we can monitor the number of checkpoints that occur. And if we have a lot of checkpoints for a certain period of time (for 10-15-20 minutes, for half an hour), for example, 3-4-5, then this can already be a problem. And you already need to look in the database, look in the configuration, what causes such an abundance of checkpoints. Maybe some big record is coming. We can already evaluate the workload, because we have already added workload charts. We can already tweak the breakpoint parameters and make sure that they do not greatly affect query performance.

PostgreSQL monitoring basics. Alexey Lesovsky

I'm going back to autovacuum again because it's the kind of thing, as I said, that can easily add up both disk and query performance, so it's always important to measure the amount of autovacuum.

The number of autovacuum workers in the database is limited. By default, there are three of them, so if we have three workers working in the database all the time, then this means that our autovacuum is underconfigured, we need to raise the limits, revise the autovacuum settings and already climb into the configuration.
It is important to evaluate which vacuum workers work for us. Either it was launched from the user, the DBA came in and launched some kind of vacuum with its hands, and this created a load. We've got some problem. Or this is the number of vacuums that unscrew the transaction counter. For some versions of PostgreSQL, these are very heavy vacuums. And they can easily add performance because they are subtracting the entire table, scanning all the blocks in this table.

And, of course, the duration of vacuums. If we have long vacuums that run for a very long time, then this means that we should again pay attention to the configuration of the vacuum and perhaps reconsider its settings. Because a situation may arise when the vacuum works on the table for a long time (3-4 hours), but during the work of the vacuum, a large amount of dead rows again managed to accumulate in the table. And as soon as the vacuum is over, he needs to vacuum this table again. And we come to a situation - an infinite vacuum. And in this case, the vacuum does not cope with its work, and the tables begin to gradually swell in size, although the amount of useful data in it remains the same. Therefore, in long vacuums, we always look at the configuration and try to optimize it, but at the same time, so that the performance of client requests does not suffer.

PostgreSQL monitoring basics. Alexey Lesovsky

Now there is practically no PostgreSQL installation where there was no streaming replication. Replication is the process of transferring data from a master to a replica.

Replication in PostgreSQL is arranged through a transaction log. The master generates a transaction log. The transaction log on the network connection goes to the replica, then it is reproduced on the replica. Everything is simple.

Accordingly, the pg_stat_replication view is used to monitor the replication lag. But it's not easy for her. In version 10, the view has undergone several changes. First, some of the fields have been renamed. And some of the fields have been added. In the 10th version, fields appeared that allow you to evaluate the replication lag in seconds. It is very comfortable. Prior to version 10, it was possible to estimate the replication lag in bytes. This feature remained in the 10th version, i.e. you can choose what is more convenient for you - evaluate the lag in bytes or evaluate the lag in seconds. Many do both.

However, in order to evaluate the replication lag, you need to know the position of the log in the transaction. And these positions of the transaction log are just in the pg_stat_replication view. Relatively speaking, we can use the pg_xlog_location_diff() function to take two points in the transaction log. Calculate the delta between them and get the replication lag in bytes. It is very convenient and simple.

In version 10 this function was renamed to pg_wal_lsn_diff(). In general, in all functions, views, utilities, where the word "xlog" was encountered, it was replaced with the value "wal". This is both in views and in functions. This is such an innovation.

Plus, in the 10th version, lines were added that specifically show the lag. These are write lag, flush lag, replay lag. That is, it is important to monitor these things. If we see that we have a replication lag, then we need to investigate why it appeared, where it came from and fix the problem.

PostgreSQL monitoring basics. Alexey Lesovsky

With system metrics, almost everything is in order. When any monitoring is born, it starts with system metrics. This is the utilization of processors, memory, swap, network and disk. But nevertheless, many parameters are not there by default.

If everything is in order with the disposal of the process, then there are problems with the disposal of the disk. As a rule, monitoring developers add bandwidth information. It can be in iops or bytes. But they forget about latency and disk device utilization. These are more important parameters that allow us to evaluate how loaded our disks are and how much they slow down. If we have high latency, then this means that there are some problems with the disks. If we have high utilization, then this means that the disks can not cope. These are more qualitative characteristics than bandwidth.

However, these statistics can also be obtained from the /proc file system, as is done for processor recycling. Why this information is not added to the monitoring, I do not know. But it's still important to have it in your monitoring.

The same is true for network interfaces. There is information about network bandwidth in packets, in bytes, but nevertheless there is no information about latency and no information about utilization, although this is also useful information.

PostgreSQL monitoring basics. Alexey Lesovsky

Any monitoring has its drawbacks. And no matter what kind of monitoring you take, it will always fail to meet certain criteria. But nevertheless, they develop, new features are added, new things, so choose something and finish it.

And in order to finish, you must always have an idea what the given statistics means and how you can solve problems with it.

And a few key points:

  • You always need to monitor availability, have dashboards so that you can quickly assess that everything is in order with the base.
  • You always need to have an idea of ​​what clients are working with your database in order to weed out bad clients and shoot them.
  • It is important to evaluate how these clients work with data. You need to have an idea about your workload.
  • It is important to evaluate how this workload is formed, with the help of what queries. You can evaluate queries, you can optimize them, refactor them, build indexes for them. It is very important.
  • Background processes can negatively impact client requests, so it's important to make sure they don't use too many resources.
  • System metrics allow you to make plans for scaling, for increasing the capacity of your servers, so it is important to track and evaluate them too.

PostgreSQL monitoring basics. Alexey Lesovsky

If you are interested in this topic, then you can follow these links.
http://bit.do/stats_collector is the official documentation from the statistics collector. There is a description of all statistical views and a description of all fields. You can read, understand and analyze them. And on the basis of them, build your own charts, add to your monitoring.

Request examples:
http://bit.do/dataegret_sql
http://bit.do/lesovsky_sql

This is our corporate repository and my own. They have sample requests. There are no queries from the select* from series, something there. There are already ready-made requests with joins, using interesting functions that allow you to make readable, convenient values ​​​​from raw numbers, that is, these are bytes, time. You can pick them, watch them, analyze them, add them to your monitorings, build your own monitorings based on them.

Questions

Question: You said that you would not advertise brands, but I'm still wondering - what kind of dashboards do you use in your projects?
Answer: In different ways. It happens that we come to the customer and he already has his own monitoring. And we advise the customer on what needs to be added to his monitoring. The worst situation is with Zabbix. Because it does not have the ability to build TopN-graphics. We ourselves use Okmeterbecause we consulted these guys on monitoring. They did PostgreSQL monitoring based on our TOR. I am writing my own pet-project, which collects data through Prometheus and draws it in grafana. My task is to make my own exporter in Prometheus and then draw everything in Grafana.

Question: Are there any analogues of AWR reports or ... aggregations? Are you aware of something like this?
Answer: Yes, I know what AWR is, it's a cool thing. At the moment, there are a variety of bicycles that implement approximately the following model. At some time interval, some baselines are written to the same PostgreSQL or to a separate storage. You can google them on the Internet, they are. One of the developers of such a thing sits on the sql.ru forum in the PostgreSQL thread. You can catch him there. Yes, there are such things, they can be used. plus in its pgCenter I also write a thing that allows you to do the same.

PS1 If you are using postgres_exporter, what dashboard are you using? There are several of them. They are already outdated. Can the community create an updated template?

PS2 Removed pganalyze as it is a proprietary SaaS offering which focuses on performance monitoring and automated tuning suggestions.

Only registered users can participate in the survey. Sign in, you are welcome.

Which self-hosted postgresql monitoring (with dashboard) do you think is the best?

  • Present in several = 30,0%Zabbix + additions from Alexey Lesovsky or zabbix 4.4 or libzbxpgsql + zabbix libzbxpgsql + zabbix3

  • Present in several = 0,0%https://github.com/lesovsky/pgcenter0

  • Present in several = 0,0%https://github.com/pg-monz/pg_monz0

  • Present in several = 20,0%https://github.com/cybertec-postgresql/pgwatch22

  • Present in several = 20,0%https://github.com/postgrespro/mamonsu2

  • Present in several = 0,0%https://www.percona.com/doc/percona-monitoring-and-management/conf-postgres.html0

  • Present in several = 10,0%pganalyze is a proprietary SaaS - can't delete1

  • Present in several = 10,0%https://github.com/powa-team/powa1

  • Present in several = 0,0%https://github.com/darold/pgbadger0

  • Present in several = 0,0%https://github.com/darold/pgcluu0

  • Present in several = 0,0%https://github.com/zalando/PGObserver0

  • Present in several = 10,0%https://github.com/spotify/postgresql-metrics1

10 users voted. 26 users abstained.

Source: habr.com

Add a comment