Transcription of the webinar "SRE - hype or the future?"

The webinar has poor audio, so we've transcribed it.

My name is Medvedev Eduard. Today I will talk about what SRE is, how SRE appeared, what are the work criteria for SRE engineers, a little about reliability criteria, a little about its monitoring. We will walk on the tops, because you can’t tell much in an hour, but I will give materials for additional review, and we are all waiting for you on Slurme SRE. in Moscow at the end of January.

First, let's talk about what SRE is - Site Reliability Engineering. And how it appeared as a separate position, as a separate direction. It all started with the fact that in traditional development circles, Dev and Ops are two completely different teams, usually with two completely different goals. The goal of the development team is to roll out new features and meet the needs of the business. The goal of the Ops team is to make sure everything works and nothing breaks. Obviously, these goals directly contradict each other: for everything to work and nothing to break, roll out new features as little as possible. Because of this, there are many internal conflicts that the methodology that is now called DevOps is trying to solve.

The problem is that we do not have a clear definition of DevOps and a clear implementation of DevOps. I spoke at a conference in Yekaterinburg 2 years ago, and until now the DevOps section began with the report “What is DevOps”. In 2017, Devops is almost 10 years old, but we are still arguing what it is. And this is a very strange situation that Google tried to solve a few years ago.

In 2016, Google released a book called Site Reliability Engineering. And in fact, it was with this book that the SRE movement began. SRE is a specific implementation of the DevOps paradigm in a specific company. SRE engineers are committed to ensuring that systems operate reliably. They mostly come from developers, sometimes administrators with a strong development background. And they do what system administrators used to do, but a strong background in development and knowledge of the system in terms of code leads to the fact that these people are not inclined to routine administrative work, but are inclined to automation.

It turns out that the DevOps paradigm in SRE teams is implemented by the fact that there are SRE engineers who solve structural problems. Here it is, the same connection between Dev and Ops that people have been talking about for 8 years. The role of an SRE is similar to that of an architect in that newcomers do not become SREs. People at the beginning of their careers do not yet have any experience, do not have the necessary breadth of knowledge. Because SRE requires a very subtle knowledge of exactly what and when exactly can go wrong. Therefore, some experience is needed here, as a rule, both inside the company and outside.

They ask if the difference between SRE and devops will be described. She has just been described. We can talk about the place of the SRE in the organization. Unlike this classic DevOps approach, where Ops is still a separate department, SRE is part of the development team. They are involved in product development. There is even an approach where SRE is a role that passes from one developer to another. They participate in code reviews in the same way as, for example, UX designers, developers themselves, sometimes product managers. SREs work at the same level. We need to approve them, we need to review them, so that for each deployment SRE says: “Okay, this deployment, this product will not negatively affect reliability. And if it does, then within some acceptable limits. We will also talk about this.

Accordingly, the SRE has a veto to change the code. And in general, this also leads to some kind of small conflict if the SRE is implemented incorrectly. In the same book about Site Reliability Engineering, many parts, not even one, tell how to avoid these conflicts.

They ask how SRE relates to information security. SRE is not directly involved in information security. Basically, in large companies, this is done by individuals, testers, analysts. But SRE also interacts with them in the sense that some operations, some commits, some deployments that affect security can also affect the availability of the product. Therefore, SRE as a whole has interaction with any teams, including security teams, including analysts. Therefore, SREs are mainly needed when they are trying to implement DevOps, but at the same time, the burden on developers becomes too large. That is, the development team itself can no longer cope with the fact that now they also need to be responsible for Ops. And there is a separate role. This role is planned in the budget. Sometimes this role is laid down in the size of the team, a separate person appears, sometimes one of the developers becomes it. This is how the first SRE appears in the team.

The complexity of the system that is affected by SRE, the complexity that affects the reliability of the operation, is necessary and accidental. Necessary complexity is when the complexity of a product increases to the extent required by new product features. Random complexity is when the complexity of the system increases, but the product feature and business requirements do not directly affect this. It turns out that either the developer made a mistake somewhere, or the algorithm is not optimal, or some additional interests are introduced that increase the complexity of the product without special need. A good SRE should always cut this situation off. That is, any commit, any deployment, any pull request, where the difficulty is increased due to random addition, should be blocked.

The question is why not just hire an engineer, a system administrator with a lot of knowledge in the team. A developer in the role of an engineer, we are told, is not the best staffing solution. A developer in the role of an engineer is not always the best staffing solution, but the point here is that a developer who is engaged in Ops has a little more desire for automation, has a little more knowledge and a skill set in order to implement this automation. And accordingly, we reduce not only the time for some specific operations, not only the routine, but also such important business parameters as MTTR (Mean Time To Recovery, recovery time). Thus, and we will also talk about this a little later, we save money for the organization.

Now let's talk about the criteria for the operation of SRE. And first of all about reliability. In small companies, startups, it often happens that people assume that if the service is written well, if the product is written well and correctly, it will work, it will not break. That's it, we write good code, so there is nothing to break. The code is very simple, there is nothing to break. These are about the same people who say that we don’t need tests, because, look, these are the three VPI methods, why break here.

This is all wrong, of course. And these people are very often bitten by such code in practice, because things break. Things break sometimes in the most unpredictable ways. Sometimes people say no, it will never happen. And it happens all the time. It happens often enough. And that's why no one ever strives for 100% availability, because 100% availability never happens. This is the norm. And therefore, when we talk about the availability of a service, we always talk about nines. 2 nines, 3 nines, 4 nines, 5 nines. If we translate this into downtime, then, for example, 5 nines, then this is a little more than 5 minutes of downtime per year, 2 nines is 3,5 days of downtime.

But it is obvious that at some point there is a decrease in POI, return on investment. Going from two nines to three nines means less downtime by more than 3 days. Going from four nines to five reduces downtime by 47 minutes per year. And it turns out that for business it may not be critical. And in general, the required reliability is not a technical issue, first of all, it is a business issue, it is a product issue. What level of downtime is acceptable for users of the product, what they expect, how much they pay, for example, how much money they lose, how much money the system loses.

An important question here is what is the reliability of the remaining components. Because the difference between 4 and 5 nines will not be visible on a smartphone with 2 nines of reliability. Roughly speaking, if something breaks on a smartphone in your service 10 times a year, most likely 8 times the breakdown occurred on the OS side. The user is used to this, and will not pay attention to one more time a year. It is necessary to correlate the price of increasing reliability and increasing profits.
Just in the book on SRE there is a good example of increasing to 4 nines from 3 nines. It turns out that the increase in availability is a little less than 0,1%. And if the revenue of the service is $1 million a year, then the increase in revenue is $900. If it costs us less than $900 a year to increase affordability by a nine, the increase makes financial sense. If it costs more than 900 dollars a year, it no longer makes sense, because the increase in revenue simply does not compensate for labor costs, resource costs. And 3 nines will be enough for us.

This is of course a simplified example where all requests are equal. And going from 3 nines to 4 nines is easy enough, but at the same time, for example, going from 2 nines to 3, this is already a savings of 9 thousand dollars, it can make financial sense. Naturally, in reality, the failure of the registration request is worse than the failure to display the page, requests have different weights. They may have a completely different criterion from a business point of view, but anyway, as a rule, if we are not talking about some specific services, this is a fairly reliable approximation.
We received a question whether SRE is one of the coordinators when choosing an architectural solution for the service. Let's say in terms of integration into the existing infrastructure, so that there is no loss in its stability. Yes, SREs, in the same way that pull requests, commits, releases affect the architecture, the introduction of new services, microservices, the implementation of new solutions. Why did I say before that experience is needed, qualifications are needed. In fact, SRE is one of the blocking voices in any architectural and software solution. Accordingly, an SRE as an engineer must, first of all, not only understand, but also understand how some specific decisions will affect reliability, stability, and understand how this relates to business needs, and from what point of view it can be acceptable and which not.

Therefore, now we can just talk about reliability criteria, which are traditionally defined in SRE as SLA (Service Level Agreement). Most likely a familiar term. SLI (Service Level Indicator). SLO (Service Level Objective). Service Level Agreement is perhaps a symbolic term, especially if you have worked with networks, with providers, with hosting. This is a general agreement that describes the performance of your entire service, penalties, some penalties for errors, metrics, criteria. And SLI is the availability metric itself. That is, what SLI can be: response time from the service, the number of errors as a percentage. It could be bandwidth if it's some sort of file hosting. When it comes to recognition algorithms, the indicator can be, for example, even the correctness of the answer. SLO (Service Level Objective) is, respectively, a combination of the SLI indicator, its value and period.

Let's say the SLA could be like this. The service is available 99,95% of the time throughout the year. Or 99 critical support tickets will be closed within 3 hours per quarter. Or 85% of queries will get responses within 1,5 seconds every month. That is, we gradually come to understand that errors and failures are quite normal. This is an acceptable situation, we are planning it, we are even counting on it to some extent. That is, SRE builds systems that can make mistakes, which must respond normally to errors, which must take them into account. And whenever possible, they should handle errors in such a way that the user either does not notice them, or notices, but there is some kind of workaround, thanks to which everything will not fall down completely.

For example, if you upload a video to YouTube, and YouTube cannot convert it immediately, if the video is too large, if the format is not optimal, then the request will naturally not fail with a timeout, YouTube will not give a 502 error, YouTube will say: “We have created everything, your video is being processed. It will be ready in about 10 minutes." This is the principle of graceful degradation, which is familiar, for example, from front-end development, if you have ever done this.

The next terms that we will talk about, which are very important for working with reliability, with errors, with expectations, are MTBF and MTTR. MTBF is the mean time between failures. MTTR Mean Time To Recovery, average time to recovery. That is, how much time has passed from the moment the error was discovered, from the moment the error appeared to the moment the service was restored to full normal operation. MTBF is mainly fixed by work on code quality. That is, the fact that SREs can say "no". And you need an understanding of the whole team that when SRE says "no", he says it not because he is harmful, not because he is bad, but because otherwise everyone will suffer.

Again, there are a lot of articles, a lot of methods, a lot of ways even in the very book that I refer to so often, how to make sure that other developers do not start to hate SRE. MTTR, on the other hand, is about working on your SLOs (Service Level Objective). And it's mostly automation. Because, for example, our SLO is an uptime of 4 nines per quarter. This means that in 3 months we can allow 13 minutes of downtime. And it turns out that MTTR cannot be more than 13 minutes. If we respond to at least 13 downtime in 1 minutes, this means that we have already exhausted the entire budget for the quarter. We are breaking the SLO. 13 minutes to react and fix a crash is a lot for a machine, but very short for a human. Because until a person receives an alert, until he reacts, until he understands the error, it's already several minutes. Until a person understands how to fix it, what exactly to fix, what to do, then this is a few more minutes. And in fact, even if you just need to restart the server, as it turns out, or raise a new node, then manually MTTR is already about 7-8 minutes. When automating the process, MTTR very often reaches a second, sometimes milliseconds. Google usually talks about milliseconds, but in reality, of course, everything is not so good.

Ideally, the SRE should automate its work almost completely, because this directly affects the MTTR, its metrics, the SLO of the entire service, and, accordingly, the business profit. If time is exceeded, we are asked if SRE is at fault. Fortunately, no one is to blame. And this is a separate culture called balmeless postmortem, which we will not talk about today, but we will analyze it on Slurm. This is a very interesting topic that can be talked about a lot. Roughly speaking, if the allotted time per quarter is exceeded, then a little bit of everyone is to blame, which means that blaming everyone is not productive, let's instead, maybe not blame anyone, but correct the situation and work with what we have. In my experience, this approach is a bit alien to most teams, especially in Russia, but it makes sense and works very well. Therefore, I will recommend at the end of the article and literature that you can read on this topic. Or come to Slurm SRE.

Let me explain. If the SLO time per quarter is exceeded, if the downtime was not 13 minutes, but 15, who can be to blame for this? Of course, SRE may be to blame, because he clearly made some kind of bad commit or deployment. The administrator of the data center may be to blame for this, because he may have carried out some kind of unscheduled maintenance. If the administrator of the data center is to blame for this, then the person from Ops is to blame for this, who did not calculate the maintenance when he coordinated the SLO. The manager, technical director or someone who signed the data center contract and did not pay attention to the fact that the SLA of the data center is not designed for the required downtime is to blame for this. Accordingly, all little by little in this situation are to blame. And it means that there is no point in laying the blame on anyone in this situation. But of course it needs to be corrected. That's why there are postmortems. And if you read, for example, GitHub postmortems, and this is always a very interesting, small and unexpected story in each case, you can replace that no one ever says that this particular person was to blame. The blame is always placed on specific imperfect processes.

Let's move on to the next question. Automation. When I talk about automation in other contexts, I often refer to a table that tells you how long you can work on automating a task without taking more time to automate it than you actually save. There is a snag. The catch is that when SREs automate a task, they not only save time, they save money, because automation directly affects MTTR. They save, so to speak, the morale of employees and developers, which is also an exhaustible resource. They reduce the routine. And all this has a positive effect on work and, as a result, on business, even if it seems that automation does not make sense in terms of time costs.

In fact, it almost always has, and there are very few cases where something should not be automated in the role of SRE. Next we will talk about what is called the error budget, the budget for errors. In fact, it turns out that if everything is much better for you than the SLO that you set for yourself, this is also not very good. This is rather bad, because SLO works not only as a lower, but also as an approximate upper bound. When you set yourself an SLO of 99% availability, and in fact you have 99,99%, it turns out that you have some space for experiments that will not harm the business at all, because you yourself have determined this all together, and you are this space do not use. You have a budget for mistakes, which in your case is not used up.

What do we do with it. We use it for literally everything. For testing in production conditions, for rolling out new features that may affect performance, for releases, for maintenance, for planned downtimes. The reverse rule also applies: if the budget is exhausted, we cannot release anything new, because otherwise we will exceed the SLO. The budget has already been exhausted, we have released something if it negatively affects performance, that is, if this is not some kind of fix that in itself directly increases the SLO, then we are going beyond the budget, and this is a bad situation, it needs to be analyzed , postmortem, and possibly some process fixes.

That is, it turns out that if the service itself does not work well, and SLO is spent and the budget is spent not on experiments, not on some releases, but by itself, then instead of some interesting fixes, instead of interesting features, instead of interesting releases. Instead of any creative work, you will have to deal with stupid fixes to get the budget back in order, or edit the SLO, and this is also a process that should not happen too often.

Therefore, it turns out that in a situation where we have more budget for errors, everyone is interested: both SRE and developers. For developers, a large budget for bugs means that you can deal with releases, tests, experiments. For SREs, a budget for errors and entering that budget means that they are directly doing their job well. And this affects the motivation of some kind of joint work. If you listen to your SREs as developers, you will have more space for good work and much less routine.

It turns out that experiments in production are quite an important and almost integral part of SRE in large teams. And it's usually called chaos engineering, which comes from the team at Netflix that released a utility called Chaos Monkey.
Chaos Monkey connects to the CI/CD pipeline and randomly crashes the server in production. Again, in the SRE structure, we are talking about the fact that a downed server is not bad in itself, it is expected. And if it is within the budget, it is acceptable and does not harm the business. Of course, Netflix has enough redundant servers, enough replication, so that all this can be fixed, and so that the user as a whole does not even notice, and even more so no one leaves one server for any budget.

Netflix had a whole suite of such utilities for a while, one of which, Chaos Gorilla, completely shuts down one of Amazon's Availability Zones. And such things help to reveal, firstly, hidden dependencies, when it is not entirely clear what affects what, what depends on what. And this, if you are working with a microservice, and the documentation is not quite perfect, this may be familiar to you. And again, this helps a lot to catch errors in the code that you cannot catch on staging, because any staging is not exactly an exact simulation, due to the fact that the load scale is different, the load pattern is different, the equipment is also, most likely, other. Peak loads can also be unexpected and unpredictable. And such testing, which again does not go beyond the budget, helps very well to catch errors in the infrastructure that staging, autotests, CI / CD pipeline will never catch. And as long as it's all included in your budget, it doesn't matter that your service went down there, although it would seem very scary, the server went down, what a nightmare. No, that's normal, that's good, that helps catch bugs. If you have a budget, then you can spend it.

Q: What literature can I recommend? List at the end. There is a lot of literature, I will advise a few reports. How does it work, and does SRE work in companies without their own software product or with minimal development. For example, in an enterprise where the main activity is not software. In an enterprise, where the main activity is not software, SRE works exactly the same as everywhere else, because in an enterprise you also need to use, even if not developed, software products, you need to roll out updates, you need to change the infrastructure, you need to grow, you need to scale. And SREs help identify and predict possible problems in these processes and control them after some growth begins and business needs change. Because it is absolutely not necessary to be involved in software development in order to have an SRE if you have at least a few servers and you are expected to have at least some growth.

The same goes for small projects, small organizations, because big companies have the budget and the space to experiment. But at the same time, all these fruits of experiments can be used anywhere, that is, SRE, of course, appeared in Google, in Netflix, in Dropbox. But at the same time, small companies and startups can already read condensed material, read books, watch reports. They start to hear about it more often, they look at specific examples, I think it's okay, it can really be useful, we need this too, it's great.

That is, all the main work on standardizing these processes has already been done for you. It remains for you to determine the role of SRE specifically in your company and begin to actually implement all these practices, which, again, have already been described. That is, from useful principles for small companies, this is always the definition of SLA, SLI, SLO. If you are not involved in software, then these will be internal SLAs and internal SLOs, an internal budget for errors. This almost always leads to some interesting discussions within the team and within the business, because it may turn out that you spend on infrastructure, on some kind of organization of ideal processes, the ideal pipeline is much more than necessary. And these 4 nines that you have in the IT department, you don’t really need them now. But at the same time, you could spend time, spend the budget for mistakes on something else.

Accordingly, monitoring and organization of monitoring is useful for a company of any size. And in general, this way of thinking, where mistakes are something acceptable, where there is a budget, where there are Objectives, it is again useful for a company of any size, starting from startups for 3 people.

The last of the technical nuances to talk about is monitoring. Because if we are talking about SLA, SLI, SLO, we cannot understand without monitoring whether we fit into the budget, whether we comply with our Objectives, and how we influence the final SLA. I have seen so many times that monitoring happens like this: there is some value, for example, the time of a request to the server, the average time, or the number of requests to the database. He has a standard determined by an engineer. If the metric deviates from the norm, then an e-mail arrives. This is all absolutely useless, as a rule, because it leads to such a glut of alerts, a glut of messages from monitoring, when a person, firstly, must interpret them every time, that is, determine whether the value of the metric means the need for some action. And secondly, he simply stops noticing all these alerts, when basically no action is required from him. That is a good monitoring rule and the very first rule when SRE is implemented is that notification should only come when action is required.

In the standard case, there are 3 levels of events. There are alerts, there are tickets, there are logs. Alerts are anything that requires you to take immediate action. That is, everything is broken, you need to fix it right now. Tickets are what require delayed action. Yes, you need to do something, you need to do something manually, automation failed, but you don't have to do it for the next few minutes. Logs are anything that doesn't require action, and in general, if things go well, no one will ever read them. You will only need to read the logs when, in retrospect, it turned out that something broke for some time, we did not know about it. Or do you need to do some research. But in general, everything that does not require any action goes to the logs.

As a side effect of all this, if we have defined what events require actions and have well described what these actions should be, this means that the action can be automated. That is, what happens. We go from alert. Let's go to action. We go to the description of this action. And then we move on to automation. That is, any automation begins with a reaction to an event.

From monitoring, we move on to a term called Observability. There has also been a bit of hype around this word for the past few years. And few people understand what it means out of context. But the main point is that Observability is a metric for system transparency. If something went wrong, how quickly can you determine what exactly went wrong and what the state of the system was at that moment. In terms of code: which function failed, which service failed. What was the state of, for example, internal variables, configuration. In terms of infrastructure, this is in which availability zone the failure occurred, and if you have any Kubernetes, then in which pod the failure occurred, what was the state of the pod. And accordingly, Observability has a direct relationship with MTTR. The higher the Observability of the service, the easier it is to identify the error, the easier it is to fix the error, the easier it is to automate the error, the lower the MTTR.

Moving on to small companies again, it is very common to ask, even now, how to deal with team size, and whether a small team needs to hire a separate SRE. Already talked about this a little earlier. At the first stages of development of a startup or, for example, a team, this is not at all necessary, because SRE can be made a transitional role. And this will revive the team a little, because there is at least some diversity. And plus it will prepare people for the fact that with growth, in general, the responsibilities of SRE will change very significantly. If you hire a person, then, of course, he has some expectations. And these expectations will not change over time, but the requirements will change very much. Therefore, how to hire an SRE is quite difficult in the early stages. Growing your own is much easier. But it's worth thinking about.

The only exception, perhaps, is when there are very strict and well-defined growth requirements. That is, in the case of a startup, this may be some kind of pressure from investors, some kind of forecast for growth several times at once. Then hiring an SRE is basically justified because it can be justified. We have requirements for growth, we need a person who will be responsible for the fact that with such growth nothing will break.

One more question. What to do when several times the developers cut a feature that passes the tests, but breaks the production, loads the base, breaks other features, what process to implement. Accordingly, in this case, it is the budget for errors that is introduced. And some of the services, some of the features are already being tested in the production. It can be canary, when only a small number of users, but already in the production, a feature is deployed, but already with the expectation that if something breaks, for example, for half a percent of all users, it will still meet the budget for errors. Accordingly, yes, there will be an error, for some users everything will break, but we have already said that this is normal.

There was a question about SRE tools. That is, is there something in particular that SREs would use that everyone else would not. In fact, there are some highly specialized utilities, there is some kind of software that, for example, simulates loads or is engaged in canary A / B testing. But basically the SRE toolkit is what your developers are already using. Because SRE interacts directly with the development team. And if you have different tools, it will turn out that it takes time to synchronize. Especially if SREs work in large teams, in large companies where there can be several teams, it is company-wide standardization that will help a lot here, because if 50 different utilities are used in 50 teams, this means that the SRE must know them all. And of course this will never happen. And the quality of work, the quality of control of at least some of the teams will decrease significantly.

Our webinar is coming to an end. I managed to tell some basic things. Of course, nothing about SRE can be told and understood in an hour. But I hope that I managed to convey this way of thinking, the main key points. And then it will be possible, if interested, to delve into the topic, learn on your own, look at how it is being implemented by other people, in other companies. And accordingly, in early February, come to us at Slurm SRE.

The Slurm SRE is a three-day intensive course that will talk about what I am now talking about, but with much more depth, with real cases, with practice, the whole intensive is aimed at practical work. People will be divided into teams. You will all be working on real cases. Accordingly, we have Booking.com instructors Ivan Kruglov and Ben Tyler. We have a wonderful Eugene Barabbas from Google, from San Francisco. And I'll tell you something too. So be sure to visit us.
So, the bibliography. There are references on SRE. First on the same book, or rather on 2 books about SRE, written by Google. Another one small article on SLA, SLI, SLO, where the terms and their application are slightly more detailed. The next 3 are reports on SRE in different companies. First - Keys to SRE, this is a keynote from Ben Trainer of Google. Second - SRE in Dropbox. The third is again SRE to Google. Fourth report from SRE on Netflix, which has only 5 key SRE employees in 190 countries. It's very interesting to look at all this, because just as DevOps means very different things to different companies and even different teams, SRE has very different responsibilities, even in companies of similar sizes.

2 more links on the principles of chaos engineering: (1), (2). And at the end there are 3 lists from the series Awesome Lists about chaos engineeringabout SRE and about SRE toolkit. The list on SRE is incredibly huge, it is not necessary to go through it all, there are about 200 articles. I highly recommend articles from there about capacity planning and about blameless postmortem.

Interesting article: SRE as a life choice

Thank you for listening to me all this time. Hope you have learned something. Hope you have enough material to learn even more. And see you. Hopefully in February.
The webinar was hosted by Eduard Medvedev.

PS: for those who like to read, Eduard gave a list of references. Those who prefer to understand in practice are welcome to Slurme SRE.

Source: habr.com

Add a comment