CASE Method: Humane Monitoring

CASE Method: Humane Monitoring
Jiiiiiiiin! At 3 o'clock in the morning, you are watching a wonderful dream, and suddenly - a call. You're on duty this week, and something must have happened. The automated system calls to find out what's the matter. This is an important point in managing modern computer systems, but let's see how to make notifications more convenient for people.

Meet the philosophy of monitoring, born over several decades of my duties in various monitoring teams. She was largely influenced by the real bible from Rob Yevaschuk My Philosophy on Alerting (My Notification Philosophy), included in the book on Google SRE, and a book by John Allspaugh Considerations for Alert Design (Remarks on configuring alerts).

Kelly Dunn, Arijit Mukheryi ΠΈ Maxim Petazzoni Thanks for helping me edit the post.

What is CASE?

I decided to come up with a beautiful abbreviation, like USE method by Brendan Gregg or Tom Wilkie's RED Method. I call it CASE method. He describes four points to pay attention to when working with automatic monitoring:

If you're using CASE, you treat notifications with healthy indifference and don't wake people up at night. Monitoring should be regularly assessed for usefulness and effectiveness. When a person receives a notification, they will have better mental models and more confidence.

To make it easier to remember, imagine that you need a CASE [that is, a case, the reason is translator's note] to justify each alert. :sunglasses:

And why is it all?

Duty can be torment. For many reasons. And CASE won't eliminate them all. But with it at night you will wake up from better notifications. This method covers various organizational processes that will also help in this matter.

The beauty of the RED and USE methods is that with their help we not only know how to work, but also speak the same language with each other. I hope that the CASE method will make it easier to discuss notifications that protect our systems but haunt our colleagues.

The bottom line is that we need to create a culture in the organization where notifications are treated with healthy indifference. Notifications can be created on a case basis, but it is not certain that they will not lose value later. Why did we set up this notification? Have the criteria been revised for a long time? With CASE, these questions can be answered.

Context-Heavy - binding to context

3 am is not the best time to read messages that have a lot of buzzwords. To respond effectively, you need information. Ideally, this should be information about a specific issue, for which the context is immediately clear, and notifications should be configured so that this is possible. This "observation" and "orientation" of NORD cycle. It’s not a pity to spend time on this setting, because constantly distracting a person is even more expensive. Let's respect each other.

CASE Method: Humane Monitoring
Problems have many sources. Especially ghosts.

How to help the attendant? First of all, the attendant sees the notification, so he builds all hypotheses based on it. Then he looks at instructions and dashboards, but is there always data on a specific notification, and not just general information? Olspaugh advises β€œthink about how you can interpret or respond to a notification” (slide 29)1. A good notification is attendant oriented and not just configured by a threshold.

Therefore, here are ideas on how to improve the context of notifications:

  • Show the user something useful and purpose-built, not just the usual instructions or a dashboard. In the past, the guys and I have been using investigative dashboards set up for specific notifications. This will help if the problem is known, and only confuse in other cases. Here you need to find a balance.
  • Tell us about the history of the notification: is it new? Does it work often? Is it seasonal?
  • Show recent changes in the state of the system. Has anything changed recently? (For example, deploy or enable/disable functionality.)
  • Show relationships and provide information for the mental model: system dependencies should be clearly visible, preferably with an indication of operability.
  • Quickly connect the user with the team: does he see current incidents or can he find out who else in the company received the notification? Program incident management activated?

Ideally, the incident management program provides advice on how to improve the context of notification in incident investigations. There is always something to work on!

Actionable - practical value

Should the attendant do something in response to the notification? If you don’t need to do anything or it’s not clear what to do, why did you wake up? You need to avoid notifications that get on duty and do not require action.

View post on imgur.com

What to do? What do you want?

In the past, when systems were simple and teams were small, we set up monitoring just to be in the know. The notification that the load on the heap has grown will give us context if the service subsequently malfunctions. On a large scale, such notifications will only confuse, because our systems are always running in a state of degradation of varying severity. This quickly leads to notification fatigue and, of course, loss of sensitivity. Therefore, the attendant ignores or even filters such notifications and does not always respond to them as it should. Don't fall into this trap! Do not set up all notifications in a row, so that later you can send them to the mail in some godforsaken folder.

Here's what a notification with practical value looks like:

  • Notification requires action, not just reporting news.
  • This action is difficult or risky to automate. If the action can be automated, then take it and automate it, stop pestering people!
  • The notice contains urgent recommendations in the form of service level agreements (SLA) or target recovery time (R.T.O.). The attendant can then invoke the organization's incident management program.

I want to clarify: I'm not saying that notifications should only come for the most important SLOs (service-level objectives, service level objectives) for the API. SLO monitoring is constantly fragmented and divided and requires the same approach to all services. It is clear that you will track the most important SLOs for the clients who pay you. But the SLOs of infrastructure, such as databases, also need to be tracked. Soon you will have to deal with internal clients and support them. And so on ad infinitum.

Symptom-based - emphasis on symptoms

Like it or not, you are working in a distributed system (Kawaj)2. As a result, you use different tactics to isolate services and protect them from failures (Trainor et al.)3. And although a prolonged garbage collection or a pondered database query indicate problems, you should not rush to fix them if users do not have problems in the near future.

These are important signals, and they can be of practical value, but if they do not disturb users, then it is not urgent enough to distract the attendant. Cause-based notifications are snapshots of our mental models of a system failure. It's better to keep track of important symptoms than to try to list all the possible causes of a failure.

For notifications to be actionable, focus on performance indicatorsthat are important to users. Evashchuk calls it "monitoring for users." Remember that this philosophy must be applied throughout the organization. If the service has urgent problems somewhere deep in the infrastructure, they will be taken care of by the appropriate team. Protecting systems from such failures is a completely separate issue (Trainer et al., section on strategies to minimize critical dependencies)3.

Symptoms are not as variable

Richard Cook recalls that complex systems are full of flaws, shortcomings and problems.4. Trying to list all possible causes is Sisyphean labor. You try to describe problems, and they change all the time. Cindy Sridharan believes that "systems do not have to be in perfect condition every second" and it is better to use a more human approach ("Distributed Systems Observability" ("Distributed Systems Watch"), 7)5.

Avoid notifications after an incident

Typically, notifications for causes are configured to fix incidents. And these limited notifications about the fact of what happened create a false sense of confidence, because the system every time comes up with new ways to break.

Don't be fooled by reason notices. Better think:

  • Why didn't the symptom-based notification notice the problem?
  • Would it be helpful to improve the context for the user?
  • How to improve monitoring tools to diagnose faster, and not accumulate notifications about what happened?

Diagnostic monitoring tools will only help if you see them as a way to move from symptom to solution. Without this feedback, you will simply be overwhelmed with belated notifications and diagrams of past failures - and not a word about future ones. For the organization, this is a great opportunity to move from defense to attack. And developers and product managers will have the same expectations and clear goals. The case - CASE (: wink:) - for each notification is clear.

Cause-based notifications are tolerable in moderation

Sometimes our system leaves us little to no choice in terms of notifications based on cause. And sometimes the attendants are well aware that the symptom will necessarily lead to a failure, which means it contains practical value. Maybe you're just not sure what's going on and are setting up notifications to be on the safe side. Let's hope this action is required temporarily until we change the system to address the performance issue.
Keep the other components of CASE in mind when dealing with these situations. Just because it's temporary doesn't mean you can stop thinking.

Evaluated

Any change in the system (new code, new infrastructure, whatever new) expands the range of failures (Cook, 3).4 Is this notification still working as expected? Clear and up-to-date mental models of systems and experience in responding to some support notifications preventive approach are the key features learning-oriented organization. Defects in systems are constantly evolving, and we must keep up with them.

You need to constantly evaluate the quality of each notification so that they work as expected. Dear leaders! It will be much easier for your teams if you help them get this process up and running! Here are some evaluation ideas:

  • Use chaos engineering, game days or other notification testing methods. The team can do it themselves without having to use a heavy incident management system!
  • Incorporate data collection of all incident notifications into the incident management program. Mark useful, harmful, inappropriate, incomprehensible, etc. Use them as feedback.
  • Proper notifications fire infrequently and are carefully tested. Make sure all links work, point to the right context, etc.
  • If the notification never fires or fires too often, something is wrong with it. Repair it or delete it. Beware of excessive passivity or activity!
  • Set notification timestamps with an expiration date. If the expiration date has expired, evaluate the notification by the CASE method and update the timestamp. Check the expiration date regularly, just like with food.
  • Simplify the process of improving notifications. Use code-based monitoring and store notifications in a Git repository. Pull requests help to engage the team, and you will have a history of past notifications. And you will no longer be afraid to change notifications or ask permission from those who are responsible for them.
  • Get notification feedback, even if it's easy google formto have attendants mark notifications as useless or intrusive. Embed a link or call to action in the notification itself and review the feedback regularly.
  • Establish a rule in the team - let the attendants work on simplifying the duty when there is little work. Let everything after you be a little better than it was before.

Conclusion

I believe the CASE method helps developers and organizations discuss setting up and sending automatic notifications. One developer can start evaluating notifications using the CASE method, and then the whole organization joins in with other developers, management, and incident management programs to keep the notifications in good condition. It does not require any special tools or complicated processes.

The entire industry needs to think about the human factor while on duty without sacrificing first-class customer service. All these tools and practices can and should be improved. I hope the CASE method will help with this.

Enjoy improved notifications!
CASE Method: Humane Monitoring

Source: habr.com

Add a comment