What I learned from testing 200 lines of infrastructure code

What I learned from testing 200 lines of infrastructure code

An approach IAC (Infrastructure as Code) consists not only of the code that is stored in the repository, but also of the people and processes that surround this code. Is it possible to reuse approaches from software development to infrastructure management and description? It will not be superfluous to keep this idea in mind while you read the article.

English version

This is my transcript Performances on DevopsConf 2019-05-28.

slides and videos

Infrastructure as bash history

What I learned from testing 200 lines of infrastructure code

Suppose you come to a new project, and they say to you: “we have Infrastructure as Code". In reality, it turns out Infrastructure as bash history or for example Documentation as bash history. This is a very real situation, for example, a similar case was described by Denis Lysenko in his speech How to replace the entire infrastructure and start sleeping peacefully, he told how they got a coherent infrastructure on the project from bash history.

With some desire, one can say that Infrastructure as bash history it's like code:

  1. reproducibility: you can take bash history, execute commands from there, maybe, by the way, you will get a working configuration as output.
  2. versioning: you know who came in and what they did, again, it’s not a fact that this will lead you to a working configuration at the exit.
  3. story: history of who did what. only you won't be able to use it if you lose the server.

What to do?

Infrastructure as Code

What I learned from testing 200 lines of infrastructure code

Even such a strange case as Infrastructure as bash history can be pulled by the ears Infrastructure as Code, but when we want to do something more complicated than the good old LAMP server, we will come to the conclusion that this code needs to be somehow modified, changed, refined. Next, we would like to consider the parallels between Infrastructure as Code and software development.

DRY

What I learned from testing 200 lines of infrastructure code

On the storage development project, there was a subtask periodically configure SDS: we are releasing a new release - it needs to be rolled out for further testing. The task is extremely simple:

  • ssh in here and run the command.
  • copy the file there.
  • edit the config here.
  • start the service there
  • ...
  • PROFIT!

For the described logic, bash is more than enough, especially in the early stages of the project, when it just starts. This it's not bad that you use bash, but over time there are requests to deploy something similar, but slightly different. The first thing that comes to mind is copy-paste. And now we have two very similar scripts that do almost the same thing. Over time, the number of scripts has grown, and we are faced with the fact that there is some kind of business logic for deploying an installation that needs to be synchronized between different scripts, which is quite difficult.

What I learned from testing 200 lines of infrastructure code

It turns out that there is such a practice DRY (Do not Repeat Yourself). The idea is to reuse existing code. Sounds simple, but it didn't come straight away. In our case, it was a banal idea: to separate configs from scripts. Those. business logic how the installation is deployed separately, configs separately.

SOLID for CFM

What I learned from testing 200 lines of infrastructure code

Over time, the project grew and natural continuation was the emergence of Ansible. The main reason for its appearance is the presence of expertise in the command and that bash is not designed for complex logic. Ansible also began to contain complex logic. In order for complex logic not to turn into chaos, there are principles of code organization in software development SOLID Also, for example, Grigory Petrov in the report “Why does an IT specialist need a personal brand” raised the question that a person is arranged in such a way that it is easier for him to operate with some social entities, in software development these are objects. If you combine these two ideas and continue to develop them, you will notice that in the infrastructure description you can also use SOLID to make it easier to maintain and modify this logic in the future.

The Single Responsibility Principle

What I learned from testing 200 lines of infrastructure code

Each class performs only one task.

No need to mix code and make monolithic divine pasta monsters. The infrastructure should consist of simple building blocks. It turns out that if you split the Ansible playbook into small pieces, read Ansible roles, then they are easier to maintain.

The Open Closed Principle

What I learned from testing 200 lines of infrastructure code

The principle of openness / closeness.

  • Open for extension: means that the behavior of an entity can be extended by creating new entity types.
  • Closed to change: As a result of extending the behavior of an entity, no changes should be made to the code that uses these entities.

Initially, we deployed the test infrastructure on virtual machines, but due to the fact that the deployment business logic was separate from the implementation, we added baremetall rollout without any problems.

The Liskov Substitution Principle

What I learned from testing 200 lines of infrastructure code

The substitution principle by Barbara Liskov. objects in a program must be replaceable with instances of their subtypes without changing the program's correct execution

If you look more broadly, then it’s not a feature of a particular project that can be applied there. SOLID, it is generally about CFM, for example, on another project, you need to deploy a boxed Java application on top of various Java, application servers, databases, OS, etc. In this example, I will consider further principles SOLID

In our case, there is an agreement within the infrastructure team that if we have installed the imbjava or oraclejava role, then we have a binary executable file java. This is necessary because superior roles depend on this behavior, they expect java. At the same time, this allows us to replace one implementation / version of java with another without changing the application deployment logic.

The problem here lies in the fact that it is impossible to implement this in Ansible, as a result, some agreements appear within the team.

The Interface Segregation Principle

What I learned from testing 200 lines of infrastructure code

Interface separation principle “Many customer-specific interfaces are better than one general-purpose interface.

Initially, we tried to put all the variability of deploying the application into one Ansible playbook, but it was difficult to maintain, and the approach, when we specify an interface outside (the client expects port 443), then for a specific implementation, you can build the infrastructure from separate bricks.

The Dependency Inversion Principle

What I learned from testing 200 lines of infrastructure code

The principle of dependency inversion. Upper-level modules should not depend on lower-level modules. Both types of modules must depend on abstractions. Abstractions should not depend on details. Details should depend on abstractions.

Here the example will be based on an antipattern.

  1. One of the customers had a private cloud.
  2. Inside the cloud, we ordered virtual machines.
  3. But due to the peculiarities of the cloud, the deployment of the application was tied to which hypervisor the VM got to.

Those. the high-level application deployment logic leaked dependencies to the underlying levels of the hypervisor, and this meant problems when reusing this logic. Do not do it this way.

Interaction

What I learned from testing 200 lines of infrastructure code

Infrastructure as code is not only about code, but also about the relationship between code and humans, about interactions between infrastructure developers.

bus factor

What I learned from testing 200 lines of infrastructure code

Let's assume that you have Vasya on the project. Vasya knows everything about your infrastructure, what will happen if Vasya suddenly disappears? This is a very real situation, because he can be hit by a bus. Sometimes it happens. If this happens and knowledge about the code, its structure, how it works, appearances and passwords are not distributed in the team, then you can face a number of unpleasant situations. In order to minimize these risks and distribute knowledge within the team, various approaches can be used.

Pair Devopsing

What I learned from testing 200 lines of infrastructure code

It's not like in jestthat admins drank beer, changed passwords, and an analogue of pair programming. Those. two engineers sit down at one computer, one keyboard and start setting up your infrastructure together: setting up the server, writing the Ansible role, and so on. Sounds nice, but it didn't work for us. But here are special cases of this practice worked. A new employee came, his mentor together with him takes on a real task, works - transfers knowledge.

Another special case is an incident call. During the problem, a group of attendants and those involved gathers, one leader is appointed, who shares his screen and voices the train of thought. Other participants follow the leader's thoughts, peep tricks from the console, check that they haven't missed a line in the log, and learn new things about the system. This approach worked more than it didn't.

Code Review

What I learned from testing 200 lines of infrastructure code

Subjectively, it was more efficient to disseminate knowledge about the infrastructure and how it was arranged using code review:

  • The infrastructure is described by the code in the repository.
  • Changes occur in a separate branch.
  • With a merge request, you can see the delta of infrastructure changes.

The highlight here was that the reviewers were selected in turn, according to the schedule, i.e. with some degree of probability you will climb into a new piece of infrastructure.

code style

What I learned from testing 200 lines of infrastructure code

Over time, squabbles began to appear during the review, because. reviewers had their own style and reviewer rotation stacked them with different styles: 2 spaces or 4, camelCase or snake_case. It didn't work out right away.

  • The first idea was to recommend using linter, after all, engineers are all smart. But different editors, OS, not convenient
  • This evolved into a bot that, for each problematic commit, wrote to slack and attached the linter output. But in most cases, there were more important things to do and the code remained uncorrected.

Green Build Master

What I learned from testing 200 lines of infrastructure code

Time passes, and we came to the conclusion that it is impossible to let commits into the master that do not pass certain tests. Voila! we invented the Green Build Master which has been practiced in software development for a long time:

  • Development is in a separate branch.
  • Tests are chasing this thread.
  • If the tests do not pass, then the code will not get into the master.

The adoption of this decision was very painful, because. caused a lot of controversy, but it was worth it, because. merge requests began to come to the review without disagreements in style, and over time, the number of problem areas began to decrease.

IaC Testing

What I learned from testing 200 lines of infrastructure code

Other than style checking, there are other things you can do, such as checking that your infrastructure can actually deploy. Or to check that changes in the infrastructure will not lead to loss of money. Why might this be needed? The question is complex and philosophical, it’s better to answer with a tale that somehow there was an auto-scaler on Powershell that didn’t check the boundary conditions => more VMs were created than necessary => the client spent more money than planned. There is little pleasant, but it would be quite possible to catch this error at earlier stages.

One might ask, why make a complex infrastructure even more difficult? Infrastructure tests, just like code tests, are not about simplifying, but about knowing how your infrastructure should work.

IaC Testing Pyramid

What I learned from testing 200 lines of infrastructure code

IaC Testing: Static Analysis

If you immediately deploy the entire infrastructure and check that it works, then it may turn out that it takes a lot of time and requires a lot of time. So there has to be something fast at the base, there's a lot of it, and it covers a lot of primitive places.

Bash is tricky

Let's look at a trivial example. select all files in the current directory and copy to another location. The first thing that comes to mind:

for i in * ; do 
    cp $i /some/path/$i.bak
done

What if there is a space in the file name? Well, ok, we are smart, we know how to use quotes:

for i in * ; do cp "$i" "/some/path/$i.bak" ; done

Well done? No! What if there is nothing in the directory, i.e. globbing won't work.

find . -type f -exec mv -v {} dst/{}.bak ;

Now, well done? nope ... Forgot that the file name can be n.

touch x
mv x  "$(printf "foonbar")"
find . -type f -print0 | xargs -0 mv -t /path/to/target-dir

Static analysis tools

The problem from the previous step could be caught when we forgot the quotes, for this there are many means in nature Shellcheck, in general there are a lot of them, and most likely you can find a linter for your stack under your IDE.

Language
tool

bash
Shellcheck

Ruby
RuboCop

python
Pylint

responsive
Ansible Lint

IaC Testing: Unit Tests

What I learned from testing 200 lines of infrastructure code

As we saw from the previous example, linters are not omnipotent and cannot point to all problem areas. Further, by analogy with testing in software development, we can recall unit tests. Here immediately come to mind shunit, June, rspec, question. But what to do with ansible, chef, saltstack and others like them?

At the very beginning we talked about SOLID and the fact that our infrastructure should consist of small bricks. Their time has come.

  1. The infrastructure is divided into small bricks, for example, Ansible roles.
  2. Some kind of environment is being deployed, be it docker or VM.
  3. We apply our Ansible role to this test environment.
  4. We check that everything worked as we expect (we run the tests).
  5. We decide ok or not ok.

IaC Testing: Unit Testing tools

The question is, what are tests for CFM? you can simply run the script, or you can use ready-made solutions for this:

CFM
tool

Ansible
Testinfra

Executive
Inspec

Executive
serverspec

salt stack
Goss

Example for testinfra, check that users test1, test2 exist and are in a group sshusers:

def test_default_users(host):
    users = ['test1', 'test2' ]
    for login in users:
        assert host.user(login).exists
        assert 'sshusers' in host.user(login).groups

What to choose? the question is complex and ambiguous, here is an example of changes in projects on github for 2018-2019:

What I learned from testing 200 lines of infrastructure code

IaC Testing frameworks

Arises how to put it all together and run? Can take it and do it yourself with a sufficient number of engineers. And you can take ready-made solutions, although there are not very many of them:

CFM
tool

Ansible
Molecule

Executive
Test Kitchen

terraform
Terratest

An example of changes in projects on github for 2018-2019:

What I learned from testing 200 lines of infrastructure code

Molecule vs. Testkitchen

What I learned from testing 200 lines of infrastructure code

Initially we tried using testkitchen:

  1. Create VM in parallel.
  2. Apply Ansible roles.
  3. Run inspec.

For 25-35 roles, it worked for 40-70 minutes, which was a long time.

What I learned from testing 200 lines of infrastructure code

The next step was the transition to jenkins / docker / ansible / molecule. Idiologically everything is the same

  1. lint playbooks.
  2. lint roles.
  3. Run container
  4. Apply Ansible roles.
  5. Run testinfra.
  6. Check idempotency.

What I learned from testing 200 lines of infrastructure code

Linting for 40 roles and tests for a dozen began to take about 15 minutes.

What I learned from testing 200 lines of infrastructure code

What to choose depends on many factors, such as the stack used, the expertise in the team, etc. here everyone decides for himself how to close the issue of Unit testing

IaC Testing: Integration Tests

What I learned from testing 200 lines of infrastructure code

The next step in the infrastructure testing pyramid is integration testing. They are similar to Unit tests:

  1. The infrastructure is divided into small bricks, such as Ansible roles.
  2. Some kind of environment is being deployed, be it docker or VM.
  3. On this test environment applies a bunch of Ansible roles.
  4. We check that everything worked as we expect (we run the tests).
  5. We decide ok or not ok.

Roughly speaking, we do not check the performance of a single element of the system as in unit tests, we check how the server is configured as a whole.

IaC Testing: End to End Tests

What I learned from testing 200 lines of infrastructure code

At the top of the pyramid we are met by End to End tests. Those. we do not check the performance of a separate server, a separate script, a separate brick of our infrastructure. We check that many servers, combined together, our infrastructure works as we expect it to. Unfortunately, I have not seen ready-made boxed solutions, probably because. the infrastructure is often unique and difficult to template and make a framework for testing it. As a result, everyone creates their own solution. There is a demand, but there is no answer. Therefore, I’ll tell you what I have to push others to sensible thoughts or poke my nose that everything was invented long ago before us.

What I learned from testing 200 lines of infrastructure code

A project with a rich history. Used in large organizations and probably each of you indirectly crossed paths. The application supports many databases, integrations, etc., etc. Knowing what the infrastructure might look like is a lot of docker-compose files, and knowing which tests to run in which environment is jenkins.

What I learned from testing 200 lines of infrastructure code

This scheme worked for a long time, while within the framework of research we didn't try to port it to Openshift. The containers remain the same, but the launch environment has changed (hello DRY again).

What I learned from testing 200 lines of infrastructure code

The idea of ​​research went further, and in openshift there was such a thing as APB (Ansible Playbook Bundle), which allows you to pack the knowledge of how to deploy the infrastructure into a container. Those. there is a reproducible, testable point of knowledge on how to deploy the infrastructure.

What I learned from testing 200 lines of infrastructure code

All this sounded good until we hit upon a heterogeneous infrastructure: we need Windows for tests. As a result, the knowledge of what, where, how to deploy, and test sits in jenkins.

Conclusion

What I learned from testing 200 lines of infrastructure code

Infrastructure as Code is

  • Code in the repository.
  • Interaction of people.
  • Infrastructure testing.

left-wing

Source: habr.com

Add a comment