We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Hi all!

Our company is engaged in software development and subsequent technical support. As part of technical support, it is required not only to fix errors, but to monitor the performance of our applications.

For example, if one of the services β€œfell”, then you need to automatically fix this problem and start solving it, and not wait for dissatisfied users to contact technical support.

We are a small company, there are no resources to study and maintain some complex solutions for application monitoring, we needed to find a simple and effective solution.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Monitoring strategy

Performing an application health check is not easy, this task is not trivial, one might even say creative. It is especially difficult to test a complex multi-link system.

How can you eat an elephant? Only occasionally! We use this approach to monitor applications.

The essence of our monitoring strategy:

Break the application into components.
For each component, come up with control checks.

A component is considered healthy if all of its control checks are performed without errors. An application is considered healthy if all its components are healthy.

Thus, any system can be represented as a tree of components. Complex components are broken down into simpler ones. Simple components have checks.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Test checks are not supposed to do functional testing, they are not unit tests. Control checks should check how the component feels at the current time, whether there are all the resources necessary for its functioning, and if there are any problems.

Miracles do not happen, most of the checks will need to be developed independently. But do not be afraid, because in most cases one check takes 5-10 lines of code, but then you can implement any logic and you will clearly understand how the check works.

Monitoring system

Suppose we split the application into components, invented and implemented checks for each component, but what to do with the results of these checks? How do we know that some check has failed?

We need a monitoring system. It will perform the following tasks:

  • Receive test results and use them to determine the status of components.
    Visually, this looks like highlighting the component tree. Healthy components turn green, problematic ones turn red.
  • Perform general checks out of the box.
    The monitoring system can perform some checks itself. Why reinvent the wheel, we will use them. For example, you can check that the site page opens or the server is pinged.
  • Sending notifications of problems to interested parties.
  • Visualization of monitoring data, provision of reports, graphs and statistics.

Brief description of the ASMO system

It is best to explain with an example. Let's see how ASMO system performance monitoring is organized.

ASMO is an automated meteorological support system. The system helps road service specialists understand where and when it is necessary to treat the road with anti-icing materials. The system collects data from traffic control points. A road control point is a place on the road where equipment is installed: a weather station, a video camera, and more. To predict dangerous situations, the system receives weather forecasts from external sources.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

So, the composition of the system is quite typical: website, agent, equipment. Let's start monitoring.

Breaking down the system into components

The following components can be distinguished in the ASMO system:

1. Personal account
This is a web application. At a minimum, you need to check that the application is available on the Internet.

2. Database
The database stores important data for reporting, you need to check that database backups are successfully created.

3. Server
The server means the hardware on which the applications run. It is necessary to check the status of HDD, RAM, CPU.

4. Agent
This is a windows service that performs many different scheduled tasks. At a minimum, you need to verify that the service is running.

5. Agent task
Just knowing that the agent is working is not enough. An agent can work, but not perform the tasks assigned to it. Let's break the agent component into tasks and check whether each task of the agent works successfully.

6. Road control points (container of all MPCs)
There are many road control points, so let's combine all MPCs in one component. This will make it easier to read monitoring data. When viewing the status of the β€œASMO system” component, it will immediately be clear where the problems are: in applications, hardware, or in the MPC.

7. Road control point (one MPC)
This component will be considered serviceable if all devices on this MPC are serviceable.

8. Device
This is a video camera or weather station that is installed on the MPC. You need to check that the device is working properly.

In the monitoring system, the component tree will look like this:

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Web application monitoring

So, we have divided the system into components, now we need to come up with checks for each component.

To monitor a web application, we use the following checks:

1. Checking the opening of the main page
This check is performed by the monitoring system. To execute it, we specify the page address, the expected response fragment and the maximum request execution time.

2. Checking the due date of the domain
A very important check. When a domain goes unpaid, users cannot open the site. It may take several days to solve the problem, because. DNS changes are not applied immediately.

3. SSL Certificate Verification
Now almost all sites use the https protocol for access. For the protocol to work correctly, you need a valid SSL certificate.

Below is the "Personal account" component in the monitoring system:

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

All of the checks above will work for most applications and don't require any coding. This is great because you can start monitoring any web application in 5 minutes. The following are additional checks that can be performed on a web application, but their implementation is more complex and application-specific, so we won't cover them in this article.

What else can be checked?

To more fully monitor a web application, you can perform the following checks:

  • Number of JavaScript errors per period
  • The number of errors on the side of the web application (back-end) for the period
  • Number of unsuccessful web application responses (response code 404, 500, etc.)
  • Average query execution time

Monitoring windows service (agent)

In the ASMO system, the agent acts as a task scheduler that performs tasks in the background according to a schedule.

If all of the agent's tasks complete successfully, then the agent is working properly. It turns out that in order to monitor an agent, you need to monitor its tasks. Therefore, we break the "Agent" component into tasks. Let's create a separate component for each task in the monitoring system, where the "Agent" component will be the "parent".

We split the "Agent" component into child components (tasks):

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

So, we have broken a complex component into several simple ones. Now we need to come up with checks for each simple component. Please note that the parent component "Agent" will not have any checks, because the monitoring system will calculate its state independently from the status of its child components. In other words, if all tasks are completed successfully, then the agent is also working successfully.

There are more than a hundred tasks in the ASMO system, is it really necessary to come up with unique checks for each task? Of course, control will be better if we come up with and implement our own special checks for each task of the agent, but in most cases it is enough to use universal checks.

The ASMO system uses only universal checks for tasks, and this is enough to monitor the system's performance.

Execution check
The simplest and most effective check is the execution check. Validation verifies that the task is running, and without errors. All tasks have this check.

Check algorithm

After each task execution, you need to send the result of the SUCCESS check to the monitoring system, if the task execution was successful, or ERROR, if the execution ended with an error.

This check detects the following issues:

  1. The task runs but fails.
  2. The task stopped running, for example, it hung.

Let's consider how these problems are solved in more detail.

Problem 1 – Task runs but fails
Below is a case where the task is running but fails between 14:00 and 16:00.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

The figure shows that when a task ends with an error, a signal is immediately sent to the monitoring system and the status of the corresponding check in the monitoring system becomes alarm.

Please note that in the monitoring system, the status of the component also depends on the status of the check. The alarm status of the check will set all upstream components to alarm, see the figure below.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Issue 2 - Task stopped running (hung)
How will the monitoring system understand that the task is hung?

The check result has a validity time, for example, 1 hour. If an hour passes, and there is no new test result, the monitoring system will set the alarm status to the test.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

In the picture above, at 14:00, the lights were turned off. At 15:00, the monitoring system will detect that the test result (from 14:00) is rotten, because the validity time has expired (one hour), and there is no new result, and will transfer the check to the alarm status.

At 16:00 the light was turned on again, the program will complete the task and send the execution result to the monitoring system, the check status will again become success.

What validation time to use?

The actual time must be greater than the task execution period. I recommend setting the relevance time 2-3 times longer than the period of the task. This is necessary in order not to receive false notifications when, for example, a task took longer than usual, or someone reloaded the program.

Checking progress

The ASMO system has a Forecast Loading task that tries to load a new forecast from an external source once an hour. The exact time when a new forecast appears in the external system is not known, but it is known that this happens 2 times a day. It turns out that if there is no new forecast for several hours, then this is normal, but if there is no new forecast for more than a day, then something has broken somewhere. For example, in an external forecast system, the data format may change, due to which the ASMO will not see a new release of the forecast.

Check algorithm

The task sends the result of the SUCCESS check to the monitoring system when it manages to get progress (download a new weather forecast). If there is no progress or an error occurs, then nothing is sent to the monitoring system.

The check should have an interval of relevance such that new progress is guaranteed to be received during this time.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Please note that we will learn about the problem with a delay, because the monitoring system is waiting for the last check result to expire. Therefore, the validity time of the check should not be made too large.

Database monitoring

To control the database in the ASMO system, we perform the following checks:

  1. Checking for backups
  2. Checking free disk space

Checking for backups
In most applications, it is important to have up-to-date database backups so that in the event of a server failure, you can deploy the program to a new server.

ASMO creates a backup copy once a week and sends it to the storage. When this procedure is successfully completed, the result of the success check is sent to the monitoring system. The result of the check has a validity time of 9 days. Those. to control the creation of backups, the β€œprogress check” mechanism, which we discussed above, is used.

Checking free disk space
If there is not enough free space on the disk, the database will not be able to work properly, so it is important to control the amount of free space.

It is convenient to use metrics to check numerical parameters.

Metrics is a numeric variable whose value is passed to the monitoring system. The monitoring system checks the thresholds and calculates the status of the metric.

Below is a picture of what the "Database" component looks like in the monitoring system:

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Server Monitoring

To monitor the server, we use the following checks and metrics:

1. Free disk space
If the disk space runs out, the application will not be able to work. We use 2 threshold values: the first level is WARNING, the second level is ALARM.

2. Average RAM percentage per hour
We use an hourly average as we are not interested in rare jumps.

3. Average CPU percentage per hour
We use an hourly average as we are not interested in rare jumps.

4. Ping check
Checks that the server is online. This check can be performed by the monitoring system, you do not need to write code.

Below is a picture of what the β€œServer” component looks like in the monitoring system:

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Equipment monitoring

I'll tell you how the data is received. For each road control point (MPC) in the task scheduler there is a task, for example, β€œPoll MPC M2 km 200”. The task receives data from all MPC devices every 30 minutes.

Communication channel problem
Most of the equipment is located outside the city, a GSM network is used for data transmission, which does not work stably (that is, the network does not exist).

Due to frequent network failures, at first, the MPC poll check in monitoring looked like this:

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

It became clear that this was not a working option, because there were a lot of false notifications about problems. Then it was decided to use a β€œprogress check” for each device, i.e. only a success signal is sent to the monitoring system when the device is polled without error. The actual time was set to 5 hours.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Now monitoring sends a notification about problems only when the device fails to poll for more than 5 hours. With a high degree of probability, these are not false alarms, but real problems.

Below is a picture of what the equipment looks like in the monitoring system:

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Important!
When the GSM network stops working, all MPC devices are not interrogated. To reduce the number of emails from the monitoring system, our engineers subscribe to notifications of component problems with the type "PDK" rather than "Device". This allows you to receive one notification for each MPC, rather than receiving a separate notification for each device.

Final ASMO monitoring scheme

Let's put everything together and see what kind of monitoring scheme we have.

We eat the elephant in parts. Application Health Monitoring Strategy with Examples

Conclusion

Let's summarize.
What did ASMO performance monitoring give us?

1. Decreased time to fix defects
We used to hear about bugs from users, but not all users report bugs. It so happened that we learned about a malfunction of any component of the system a week after its appearance. Now the monitoring system notifies us of problems as soon as a problem is detected.

2. Increased system stability
Since defects began to be eliminated earlier, the system as a whole began to work much more stable.

3. Reducing the number of calls to technical support
Many issues are now fixed before users become aware of them. Users began to contact technical support less. All this is good for our reputation.

4. Increase customer and user loyalty
The customer noticed positive changes in the stability of the system. Users face fewer problems in working with the system.

5. Reduce technical support costs
We have stopped doing any manual checks. Now all checks are automated. Previously, we learned about problems from users, it was often difficult to understand what problem the user was talking about. Now the monitoring system reports most of the problems, notifications contain technical data, according to which it is always clear what is broken and where.

Important!
You cannot install the monitoring system on the same server where your applications are running. If the server goes down, then the applications will stop working, and there will be no one to send a notification about it.

The monitoring system must run on a separate server in another data center.

If you do not want to use a dedicated server in a new data center, you can use a cloud monitoring system. Our company uses the Zidium cloud monitoring system, but you can use any other monitoring system. The cost of a cloud monitoring system is lower than renting a new server.

Recommendations:

  1. Break down applications and systems in the form of a tree of components in as much detail as possible, so it will be convenient to understand where and what is broken, and control will be more complete.
  2. To check the functionality of a component, use checks. It is better to use many simple checks than one complex one.
  3. Set the threshold values ​​of the metrics on the side of the monitoring system, and do not write them in the code. This saves you from recompiling, reconfiguring or restarting the application.
  4. For custom checks, use the freshness time with a margin so that you don't get false notifications due to the fact that some check took a little longer than usual.
  5. Try to make the components in the monitoring system turn red only when there is definitely a problem. If they turn red for nothing, then you will stop paying attention to the notifications of the monitoring system, its meaning will be lost.

If you are not yet using a monitoring system, get started! It's not as difficult as it seems. Get high looking at the green tree of components that you have grown yourself.

Good luck.

Source: habr.com

Add a comment