Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

Josh Evans talks about the chaotic and colorful world of Netflix microservices, starting from the very basics - the anatomy of microservices, the problems associated with distributed systems, and their benefits. Building on this foundation, he explores the cultural, architectural, and operational practices that lead to mastery of microservices.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 1
Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 2
Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 3

Unlike operational drift, the introduction of new languages ​​for service internationalization and new technologies such as containers are conscious decisions to add new complexity to the environment. My Operations team standardized a tarmac of best-of-breed technologies for Netflix that was baked into predefined best practices based on Java and EC2, but as the business grew, developers added new components such as Python, Ruby, Node-JS, and Docker.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

I am very proud that we were the first to advocate for our product to work perfectly without waiting for customer complaints. It all started out simple enough - we had Python ops and a few Ruby back office applications - but things got a lot more interesting when our web developers announced that they were going to move away from the JVM and were going to migrate the web application to the Node. js. Since the introduction of Docker, things have gotten a lot more complicated. We were guided by logic, and the technologies we invented became a reality when we implemented them for clients, because they made a lot of sense. I'll tell you why this is so.

The API Gateway actually has the ability to integrate great scripts that can act as endpoints for UI developers. They converted each of these scripts in such a way that, after making changes, they could deploy them to production and further to user devices, and all these changes were synchronized with endpoints that ran in the API gateway.

However, this repeated the problem of creating a new monolith when the API service was overloaded with code such that various failure scenarios occurred. For example, some endpoints were removed, or scripts randomly generated so many versions of something that these versions took up all the available memory of the API service.

It was logical to take these endpoints and pull them out of the API service. To do this, we created Node.js components that ran as small applications in Docker containers. This allowed us to isolate any problems and crashes caused by these node applications.

The cost of these changes is quite high and consists of the following factors:

  • Productivity tools. Managing new technologies required new tools, because the UI team, which used very good scripts to create an effective model, did not have to spend a lot of time managing the infrastructure, it only had to write scripts and check their performance.
    Insight and Opportunity Sorting - A key example are the new tools needed to uncover information about performance factors. It was necessary to know how much % the processor is busy, how the memory is used, and collecting this information required different tools.
  • Fragmentation of base images - the simple base AMI has become more fragmented and specialized.
  • Node management. There is no out-of-the-box architecture or technology available that allows you to manage nodes in the cloud, which is why we created Titus, a container management platform that provides scalable and reliable container deployment and cloud integration with Amazon AWS.
  • Duplicate library or platform. Giving new technologies the same core functionality of the platform required duplicating it into cloud-based Node.js developer tools.
  • Learning Curve and Industrial Experience. The introduction of new technologies inevitably creates new challenges that need to be overcome and learned from.

Thus, we could not limit ourselves to one "paved road" and had to constantly build new ways to advance our technologies. To keep costs down, we limited centralized support and focused on the JVM, new nodes, and Docker. We prioritized the degree of effective impact, informed the teams about the cost of their decisions, and encouraged them to look for opportunities to reuse already developed effective solutions. We used this approach when translating the service into foreign languages ​​to deliver the product to international customers. Examples are relatively simple client libraries that can be auto-generated, so it's fairly easy to create a Python version, a Ruby version, a Java version, and so on.

We were constantly looking for the opportunity to use proven technologies that have proven themselves in one place, and in other similar situations.

Let's talk about the last element - changes, or variations. See how unevenly the consumption of our product varies by day of the week and by hour throughout the day. It can be said that at 9 a.m. for Netflix, the time of hard tests begins, when the load on the system reaches its maximum.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

How can we achieve a high rate of software innovation, that is, constantly make new changes to the system without causing interruptions in service delivery and without creating inconvenience to our customers? Netflix achieved this by leveraging Spinnaker, a new global cloud-based management and continuous delivery (CD) platform.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

Critically, Spinnaker was designed to integrate our best practices so that as components are deployed to production, we can integrate their output directly into the media content delivery technology.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

We were able to use two technologies in the delivery pipeline that we highly value: automated canary analysis and staged deployment. Canary analysis means that we send a trickle of traffic to the new version of the code, and pass the rest of the production traffic through the old version. Then we check how the new code is doing better or worse than the existing one.

Staged deployment means that if there are problems with a deployment in one region, we move to a deployment in another region. At the same time, the checklist mentioned above is necessarily included in the production pipeline. I'll save you some time and recommend that you check out my previous talk "Engineering Netflix's Global Operations in the Cloud" if you'd like to delve deeper into this issue. A video recording of the speech can be viewed by clicking on the link at the bottom of the slide.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

At the end of the talk, I will briefly talk about the organization and architecture of Netflix. At the very beginning, we had a scheme called Electronic Delivery - electronic delivery, which was the first version of NRDP 1.x media content streaming. The term “reverse stream” can be used here, because initially the user could only download content for later playback on the device. The very first Netflix e-delivery platform in 2009 looked something like this.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

The user device contained the Netflix application, which consisted of a UI interface, security modules, service activation and playback, based on the NRDP - Netflix Ready Device Platform.

At that time, the user interface was very simple. It contained a so-called Queque Reader and the user would go to the site to add something to the Queque and then view the added content on their device. On the positive side, the client team and server team belonged to the same Electronic Delivery organization and had a close working relationship. The payload was created based on XML. In parallel, the Netflix DVD Business API was created to encourage third-party applications to send traffic to our service.

However, the Netflix API was well equipped to help us with an innovative user interface, it contained the metadata of all content, information about what movies were available, which made it possible to generate watchlists. It had a generic REST API based on a JSON schema, an HTTP Response Code like the one used in modern architecture, and an OAuth security model that was required at the time for an external application. This made it possible to move from a public streaming content delivery model to a private one.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

The transition problem was fragmentation, since now two services were functioning in our system, based on completely different principles of operation - one on Rest, JSON and OAuth, the other on RPC, XML and a user security mechanism based on the NTBA token system. It was the first hybrid architecture.

There was essentially a firewall between our two teams, because the API didn't scale well with NCCP initially, and this led to contention between the teams. The differences were in services, protocols, schemas, security modules, and developers often had to switch between completely different contexts.

Qcon conference. Mastering Chaos: A Netflix Guide to Microservices. Part 4

In this regard, I had a conversation with one of the senior engineers of the company, whom I asked the question: "What should be the right long-term architecture?" we integrate these things, and they will break what we are well trained to do? This approach is very relevant to Conway's law: "Organizations that design systems are constrained by designs that replicate the structure of communication in that organization." This is a very abstract definition, so I prefer a more specific one: "Any piece of software reflects the organizational structure that created it." Here's my favorite quote from Eric Raymond: "If you have four development teams working on a compiler, you end up with a four-pass compiler." Well, Netflix has a four-pass compiler and that's how we work.

We can say that in this case the tail waves the dog. Our first place is not a solution, but an organization, it is she who is the driver of the architecture that we have. Gradually, we moved from a hodgepodge of services to an architecture that we called Blade Runner - “Blade Runner”, because here we are talking about edge services and the ability of NCCP to separate and integrate directly into the Zuul proxy, API gateway, and the corresponding functional “pieces” have been turned into new microservices with more advanced security features, replay, data sorting, etc.

Thus, it can be said that departmental structures and company dynamics play an important role in shaping the design of systems and are a factor facilitating or hindering change. The architecture of microservices is complex and organic, and its health is based on discipline and implemented chaos.

Some advertising

Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, cloud VPS for developers from $4.99, a unique analogue of entry-level servers, which was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $19 or how to share a server? (available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper in Equinix Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $99! Read about How to build infrastructure corp. class with the use of Dell R730xd E5-2650 v4 servers worth 9000 euros for a penny?

Source: habr.com

Add a comment