How to sleep well when you have a cloud service: basic architectural tips

How to sleep well when you have a cloud service: basic architectural tipsLost by sophiagworld

This article has collected some common patterns to help engineers work with large-scale services that are being requested by millions of users. 

In the author's experience, this is not an exhaustive list, but indeed effective adviсe. So, let's begin.

Translated with support Mail.ru Cloud Solutions.

First level

The measures listed below are relatively easy to implement, but provide high returns. If you haven't tried them before, you'll be surprised at the significant improvements.

Infrastructure as code

The first part of the advice is to implement the infrastructure as code. This means that you must have a programmatic way to deploy the entire infrastructure. Sounds fancy, but we're actually talking about the following code:

Deploying 100 virtual machines

  • with Ubuntu
  • 2 GB RAM each
  • they will have the following code
  • with these settings

You can track infrastructure changes and quickly revert to them using source control.

The modernist in me says you can use Kubernetes/Docker to do all of the above, and he's right.

In addition, you can provide automation using Chef, Puppet or Terraform.

Continuous Integration and Delivery

To create a scalable service, it is important to have a build and test pipeline for each pull request. Even if the test is the simplest, it at least guarantees that the code you deploy compiles.

Each time at this stage, you answer the question: will my build compile and pass the tests, is it valid? This may seem like a low bar, but it solves a lot of problems.

How to sleep well when you have a cloud service: basic architectural tips
There is nothing more beautiful than seeing these ticks

For this technology, you can check out Github, CircleCI or Jenkins.

Load balancers

So, we want to run a load balancer to redirect traffic, and ensure equal load on all nodes or service operation in case of failure:

How to sleep well when you have a cloud service: basic architectural tips
A load balancer is usually a good help in distributing traffic. The best practice is to overbalance so you don't have a single point of failure.

Typically, load balancers are configured in the cloud you use.

RayID, correlation ID or UUID for requests

Have you ever encountered an application error with a message like this: "Something went wrong. Save this id and send it to our support team"?

How to sleep well when you have a cloud service: basic architectural tips
The unique identifier, correlation ID, RayID, or any of the variations, is a unique identifier that allows you to track the request during its life cycle. This allows you to track the entire path of the request in the logs.

How to sleep well when you have a cloud service: basic architectural tips
The user makes a request to system A, then A contacts B, which contacts C, stores in X, and then the request returns to A

If you were to remotely connect to virtual machines and try to trace the path of a request (and manually correlate which calls occur), you would go crazy. Having a unique ID makes life a lot easier. This is one of the easiest things you can do to save time as your service grows.

Average level

Here, the tips are more complex than the previous ones, but the right tools make it easy, providing a return on investment even for small and medium-sized companies.

Centralized logging

Congratulations! You have deployed 100 virtual machines. The next day, the CEO comes and complains about a bug he got while testing the service. It reports the corresponding ID we talked about above, but you'll have to go through the logs of 100 machines to find the one that caused the crash. And it needs to be found before tomorrow's presentation.

While this sounds like a fun adventure, it's best to make sure you have the ability to search through all the magazines from one place. I solved the problem of log centralization using the built-in functionality of the ELK stack: searchable log collection is supported here. This will really help solve the problem of finding a specific log. As a bonus, you can create charts and other fun stuff.

How to sleep well when you have a cloud service: basic architectural tips
ELK stack functionality

Monitoring Agents

Now that your service is up and running, you need to make sure it runs smoothly. The best way to do this is to run several agents, which work in parallel and check that it works and basic operations are performed.

At this point, you check that running build feels good and works fine.

For small to medium projects, I recommend Postman for API monitoring and documentation. But in general, you just need to make sure that you have a way to know when a failure has occurred and be notified in a timely manner.

Autoscaling based on load

It's very simple. If you have a VM serving requests and it's getting close to 80% memory usage, then you can either increase its resources or add more VMs to the cluster. The automatic execution of these operations is great for elastic power changes under load. But you should always be careful about how much money you spend and set reasonable limits.

How to sleep well when you have a cloud service: basic architectural tips
With most cloud services, you can set up automatic scaling with more servers or more powerful servers.

Experiment system

A good way to deploy updates safely is to be able to test something for 1% of users within an hour. You have, of course, seen such mechanisms in action. For example, Facebook shows parts of the audience in a different color or changes the font size to see how users perceive the changes. This is called A/B testing.

Even releasing a new feature can be run as an experiment and then figured out how to release it. You also get the ability to "remember" or change the configuration on the fly to account for a feature that is degrading your service.

Advanced level

Here are tips that are quite difficult to implement. You will probably need a little more resources, so it will be difficult for a small or medium company to handle it.

Blue-green deployments

This is what I call the "Erlang" way of deploying. Erlang became widely used when the phone companies came into existence. Softswitches began to be used to route telephone calls. The main goal of the software of these switches was not to drop calls during a system upgrade. Erlang has a great way to load a new module without crashing the previous one.

This step depends on the presence of a load balancer. Let's say you have version N of your software, and then you want to deploy version N+1. 

You we could just stop the service and deploy the next version at a time that is convenient for your users and get some downtime. But suppose you have really strict SLA conditions. So, a 99,99% SLA means you can go offline. only 52 minutes per year.

If you really want to achieve such indicators, you need two deployments at the same time: 

  • the one that is right now (N);
  • next version (N+1). 

You tell the load balancer to redirect a percentage of the traffic to the new version (N+1) while you actively track regressions yourself.

How to sleep well when you have a cloud service: basic architectural tips
Here we have a green deploy N that works fine. We are trying to move to the next version of this deployment

We first send a really small test to see if our N+1 deployment works with a small amount of traffic:

How to sleep well when you have a cloud service: basic architectural tips
Finally, we have a set of automated checks that we end up running until our deployment is complete. If you very very careful, you can also save your N deployment forever for a quick rollback in case of a bad regression:

How to sleep well when you have a cloud service: basic architectural tips
If you want to go to an even more advanced level, let everything in the blue-green deployment be done automatically.

Anomaly detection and automatic mitigation

Given that you have centralized logging and good logging, it's already possible to set higher goals. For example, proactively predict failures. Functions are tracked on monitors and in logs and various diagrams are built - and you can predict in advance what will go wrong:

How to sleep well when you have a cloud service: basic architectural tips
With anomaly detection, you start looking into some of the clues that the service gives out. For example, a spike in CPU load might indicate that a hard drive is failing, and a spike in requests means you need to scale. This kind of statistical data allows you to make the service proactive.

With these insights, you can scale in any dimension and proactively and reactively change the characteristics of machines, databases, connections, and other resources.

That's all!

This list of priorities will save you a lot of problems if you are raising a cloud service.

The author of the original article invites readers to leave their comments and make changes. The article is distributed as open source, pull requests by the author accepts on Github.

What else to read on the topic:

  1. Go and CPU caches
  2. Kubernetes in the spirit of piracy with a template for implementation
  3. Our channel Around Kubernetes in Telegram

Source: habr.com

Add a comment