Conference NDC London. Microservice disaster prevention. Part 1

You've spent months converting your monolith into microservices, and finally everyone has come together to flip the switch. You go to the first web page... and nothing happens. Reload it - and again no good, the site is so slow that it does not respond for several minutes. What happened?

In his keynote, Jimmy Bogard will conduct a post-mortem autopsy on a real-life microservice disaster. He will show the modeling, development, and manufacturing problems he uncovered and how his team slowly transformed the new distributed monolith into the ultimate picture of sanity. Although it is not possible to completely prevent design errors, it is at least possible to identify problems early in the design process so that the final product becomes a reliable distributed system.

Conference NDC London. Microservice disaster prevention. Part 1

Hi everyone, I'm Jimmy, and today you'll hear how you can avoid mega-disasters when building microservices. This is the story of a company where I worked for about a year and a half to help prevent their ship from colliding with an iceberg. To tell this story properly, you have to go back in time and talk about how this company started and how its IT infrastructure has grown over time. To protect the names of the innocent in this disaster, I have changed the name of this company to Bell Computers. The next slide shows what the IT infrastructure of such companies looked like in the mid-90s. This is a typical architecture for a large universal fault-tolerant HP Tandem Mainframe server for the operation of a computer equipment store.

Conference NDC London. Microservice disaster prevention. Part 1

They needed to build a system to manage all orders, sales, returns, product catalogs, customer base, so they chose the most common mainframe solution at the time. This giant system contained every bit of information about the company, everything that could be, and every transaction was carried out through this mainframe. They kept all their eggs in one basket and thought it was okay. The only things not included are mail-order catalogs and telephone ordering.

Over time, the system became larger and larger, it accumulated a huge amount of garbage. Also, COBOL is not the most expressive language in the world, so it ended up being a big, monolithic piece of garbage. By 2000, they saw that a lot of companies had websites through which they conduct absolutely all business, and decided to build their first commercial dot-com site.

The initial design looked quite nice and consisted of a top-level site bell.com and a number of subdomains for individual applications: the catalog.bell.com catalog, account.bell.com accounts, orders.bell.com orders, search.bell.com product search. Each subdomain used the ASP.Net 1.0 framework and its own databases, and they all communicated with the system's backend. However, all orders continued to be processed and executed inside a single huge mainframe, in which all the garbage remained, but the front-end was a separate website with individual applications and separate databases.

Conference NDC London. Microservice disaster prevention. Part 1

So the design of the system looked orderly and logical, but the actual system was as shown on the next slide.

Conference NDC London. Microservice disaster prevention. Part 1

All elements addressed calls to each other, accessed APIs, embedded third-party dlls, and the like. It often happened that version control systems grabbed someone else's code, stuffed it inside the project, and then everything broke. MS SQL Server 2005 used the concept of link servers, and although I did not show the arrows on the slide, each of the databases also communicated with each other, because there is nothing wrong with building tables based on data obtained from several databases .

Since they now had some divisions between different logical areas of the system, it turned into distributed clods of dirt, and the biggest piece of garbage still remained in the mainframe backend.

Conference NDC London. Microservice disaster prevention. Part 1

The funniest thing was that this mainframe was built by rival Bell Computers and still maintained by their technical consultants. Convinced of the unsatisfactory performance of their applications, the company decided to get rid of them and redesign the system.

The existing application has been in production for 15 years, which is a record for an ASP.Net based application. The service accepted orders from all over the world, and the annual profit from this single application reached a billion dollars. A significant part of the profit was generated by the bell.com website. On Black Fridays, the number of orders made through the site reached several million. However, the existing architecture did not allow any development, since the rigid interconnections of the system elements practically did not allow any changes to be made to the service.

The most serious problem was the inability to place an order from one country, pay for it in another and send it to a third, despite the fact that such a trading scheme is very common in global companies. The existing website didn't allow anything like that, so they had to take and place such orders over the phone. This led the company to think more and more about changing the architecture, in particular, about switching to microservices.

They made the smart decision to look at other companies' experiences to see how they solved a similar problem. One such solution was the architecture of the Netflix service, which is a microservice connected via an API and an external database.

The management of Bell Computers decided to build just such an architecture, adhering to some basic principles. First, they eliminated data duplication by using a shared database approach. No data was sent, on the contrary, everyone who needed it had to turn to a centralized source. This was followed by isolation and autonomy - each service was independent of the others. They decided to use the Web API for absolutely everything - if you wanted to get data or make changes to another system, it was all done through the Web API. The last big thing was a new mainframe called "Bell on Bell" as opposed to the "Bell" mainframe based on competitors' hardware.

So, over the course of 18 months, they built the system around those basic principles and took it to pre-production. After returning to work after the weekend, the developers got together and turned on all the servers to which the new system was connected. 18 months of work, hundreds of developers, the latest Bell hardware - and no positive result! This has frustrated a lot of people because they have repeatedly run this system on their laptops and everything was fine.

They acted wisely, throwing all the money to solve this problem. They installed the latest server racks with switches, used gigabit fiber, the most powerful server hardware with crazy amounts of RAM, connected it all, set it up - and again nothing! Then they began to suspect that the reason might be in timeouts, so they went into all web settings, all API settings and updated the entire timeout configuration to the maximum values, so all they had to do was sit and wait for something to happen to the site. They waited and waited and waited for 9 and a half minutes until the website finally loaded.

After that, it dawned on them that the current situation needed a thorough analysis, and they invited us. The first thing we found out is that during the entire 18 months of development, not a single real "micro" was created - everything just got bigger. After that, we started writing a post-mortem, also known as a "regretrospective", or "sad retrospective", aka "blame storm" - "accusatory storm" by analogy with brainstorming "brain storm", to understand the cause of the disaster.

We had several clues, one of which was complete traffic saturation at the time of the API call. When you use a monolithic service architecture, you can immediately understand what exactly went wrong, because you have a single stack trace that reports everything that could cause a failure. In the case when a bunch of services access the same API at the same time, there is no way to trace the trace, except to use additional network monitoring tools like WireShark, thanks to which you can examine a single request and find out what happened during its implementation. Therefore, we took one web page and for almost 2 weeks put pieces of the puzzle together, making various calls to it and analyzing what each of them leads to.
Look at this picture. It shows that one external request causes the service to make many internal calls, which return back. It turns out that each internal call makes additional hops in order to be able to serve this request on its own, because it cannot go anywhere else to obtain the necessary information. This picture looks like a meaningless cascade of calls, because the outer request calls additional services that call other additional services, and so on almost ad infinitum.

Conference NDC London. Microservice disaster prevention. Part 1

The green color in this diagram shows a semicircle in which services call each other - service A calls service B, service B calls service C, and service A again calls. As a result, we get a "distributed deadlock". A single request made a thousand network API calls, and since the system did not have built-in failover and loop protection, a request would fail if even one of those API calls failed.

We've done some math. Each API call had an SLA of no more than 150ms and 99,9% uptime. One request made 200 different calls, and in the best case the page could be shown in 200 x 150ms = 30 seconds. Naturally, this didn't work. Multiplying 99,9% uptime by 200, we got 0% availability. It turns out that this architecture was doomed to failure from the very beginning.

We turned to the developers with a question, how did they fail to see this problem for 18 months of work? It turned out that they only counted the SLA for the code they ran, but if their service called another service, they did not count this time in their SLAs. Everything that started within the same process adhered to the value of 150 ms, but accessing other service processes increased the total delay many times over. The first lesson learned from this was: "Do you own your SLA, or does the SLA own you"? In our case, the second one came out.

The next thing we found was that they knew about the concept of distributed computing fallacies, formulated by Peter Deutsch and James Gosling, but ignored the first part of it. It states that the statements "the network is reliable", "the latency is zero" and "the bandwidth is infinite" are fallacies. The statements “the network is secure”, “the topology never changes”, “there is always only one administrator”, “the cost of data transfer is zero” and “the network is homogeneous” are also fallacies.
They made a mistake because they were running their service on local machines and never “hooked up” external services. When developing locally and using a local cache, they never encountered network hops. In all 18 months of development, they never once wondered what could happen if external services were affected.

Conference NDC London. Microservice disaster prevention. Part 1

If you look at the boundaries of the services in the previous picture, you can see that they are all wrong. There are a lot of sources out there that advise how to define service boundaries, and most do it wrong, like Microsoft on the next slide.

Conference NDC London. Microservice disaster prevention. Part 1

This picture is from the MS blog on how to build microservices. Shown here is a simple web application, a business logic block, and a database. The request comes in directly, probably there is one web server, one business server, and one database server. If you increase traffic, the picture will change a little.

Conference NDC London. Microservice disaster prevention. Part 1

This is where a load balancer comes in to distribute traffic between two web servers, a cache located between the web service and the business logic, and another cache between the business logic and the database. This is exactly the architecture that Bell used for its load balancing and blue/green deployment application in the mid-2000s. Until some time, everything worked well, since this scheme was intended for a monolithic structure.

The following picture shows how MS recommends moving from a monolith to microservices - simply split each of the main services into separate microservices. It was during the implementation of this scheme that Bell made a mistake.

Conference NDC London. Microservice disaster prevention. Part 1

They broke all their services into different layers, each of which consisted of many individual services. For example, the web service included microservices for content rendering and authentication, the business logic service consisted of microservices for processing orders and account information, the database was divided into a bunch of microservices with specialized data. Both the web, and business logic, and the database were stateless services.

However, this picture was completely wrong because it did not map any business units outside the company's IT cluster. This scheme did not take into account any connection with the outside world, so it was not clear how, for example, to receive third-party business intelligence. I note that they also had several services that were invented simply to develop the careers of individual employees who sought to manage as many people as possible in order to get more money for it.

They believed that the transition to microservices was as simple as taking their internal N-tier physical layer infrastructure and inserting Docker into it. Let's take a look at what the traditional N-tier architecture looks like.

Conference NDC London. Microservice disaster prevention. Part 1

It consists of 4 layers: user interface layer UI, business logic layer, data access layer and database. More progressive is DDD (Domain-Driven Design), or software-oriented architecture, where the two middle levels are domain objects and a repository.

Conference NDC London. Microservice disaster prevention. Part 1

I tried to look at this architecture in different areas of change, different areas of responsibility. In a typical N-tier application, different areas of change are classified, which permeate the structure vertically from top to bottom. These are the Catalog, the Config settings performed on individual computers, and the Checkout checks that my team did.

Conference NDC London. Microservice disaster prevention. Part 1

The peculiarity of this scheme is that the boundaries of these areas of change affect not only the level of business logic, but also extend to the database.

Let's look at what it means to be a service. There are 6 characteristic properties of a service definition - it is software that:

  • created and used by a specific organization;
  • is responsible for the content, processing and / or provision of a certain type of information within the system;
  • can be created, deployed and run independently to meet specific operational needs;
  • communicates with consumers and other services, providing information on the basis of agreements or contractual guarantees;
  • protects itself from unauthorized access, and its information from loss;
  • handles failures in such a way that they do not lead to information corruption.

All these properties can be expressed in one word "autonomy". Services work independently of each other, satisfy certain restrictions, define contracts on the basis of which people can receive the information they need. I did not mention specific technologies, the use of which is implied by itself.

Now consider the definition of microservices:

  • the microservice is small and designed to solve one specific problem;
  • microservice is autonomous;
  • when creating a microservice architecture, the town planning metaphor is used. This definition is from Sam Newman's book Building Microservices.

The definition of Bounded Context is taken from the book Domain-Driven Design by Eric Evans. This is the main pattern in DDD, an architectural design center that works with XNUMXD architectural models, dividing them into different Bounded Contexts and explicitly defining the interaction between them.

Conference NDC London. Microservice disaster prevention. Part 1

Simply put, a Bounded Context denotes the scope in which a particular module can be used. Within this context is a logically unified model that can be seen, for example, in your business domain. If you ask "who is a customer" from the ordering staff, you will get one definition, if you ask the sales people, you will get another, and the performers will give you a third definition.

So, Bounded Context says that if we can't give an unambiguous definition of what constitutes a consumer of our services, let's define the boundaries within which we can reason about the meaning of this term, and then define the transition points between these different definitions. That is, if we are talking about the client in terms of placing orders, this means this and that, and if in terms of sales, this and that.

The next definition of a microservice is the encapsulation of any kind of internal operations, preventing the components of the workflow from “leaking” into the environment. What follows is the "definition of explicit contracts for external interactions, or external relations", which is represented by the idea of ​​contracts coming back from the SLA. The last definition is the metaphor of a cell, or cell, which means the complete encapsulation of a set of operations within a microservice and the presence in it of receptors for communicating with the outside world.

Conference NDC London. Microservice disaster prevention. Part 1

So we said to the guys at Bell Computers, "We can't fix anything in the chaos you've created because you just don't have enough money to do it, but we'll fix just one service to make it all make sense." From here I will start the story of how we fixed a single service so that it began to respond to requests in less than 9 and a half minutes.

22:30 min

To be continued very soon...

Some advertising

Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, cloud VPS for developers from $4.99, a unique analogue of entry-level servers, which was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $19 or how to share a server? (available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper in Equinix Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $99! Read about How to build infrastructure corp. class with the use of Dell R730xd E5-2650 v4 servers worth 9000 euros for a penny?

Source: habr.com

Add a comment