Starting from the second commit, any code becomes legacy, because initial ideas begin to diverge from harsh reality. This is neither good nor bad, it is a given that is difficult to argue with and must be lived with. Part of this process is refactoring. Refactoring Infrastructure as Code. Let the story begin on how to refactor Ansible in a year and not go crazy.
The birth of the Legacy
Day #1: Patient Zero
Once upon a time there was a conditional project. It had a Dev development team and Ops engineers. They were solving the same problem: how to deploy servers and run an application. The problem was that each team solved this problem in its own way. At the project, it was decided to use Ansible to synchronize knowledge between the Dev and Ops teams.
Day #89: The birth of the Legacy
Without noticing it themselves, they wanted to do it as best as possible, but it turned out to be legacy. How does this happen?
We have an urgent task here, we will do a dirty hack - then we will fix it.
Documentation can be omitted and everything is clear what is happening here.
I know Ansible / Python / Bash / Terraform! See how I can dodge!
I copied this from stackoverflow, Full Stack Overflow Developer, I don't know how it works, but it looks cool and solves the problem.
As a result, you can get a code of an incomprehensible type, for which there is no documentation, it is not clear what it does, whether it is needed, but the problem is that you need to develop it, refine it, add crutches with props, making the situation even worse.
The initially conceived and implemented IaC model no longer meets the requirements of users / business / other teams, and the time to make changes to the infrastructure ceases to be acceptable. At this moment, the understanding comes that it is time to take action.
IaC refactoring
Day #139: Do you really need refactoring?
Before you rush to refactor, you must answer a number of important questions:
Why do you need all this?
Do you have time?
Is knowledge enough?
If you don't know how to answer the questions, then the refactoring will end before it even starts, or it can only get worse. Because was an experience What I learned from testing 200 lines of infrastructure code), then a request came from the project to help fix the roles and cover them with tests.
Day #149: Preparing the refactoring
The first thing is to be prepared. Decide what we will do. To do this, we communicate, find problem points and figure out ways to solve them. We somehow fix the received concepts, for example, an article in confluence, so that when the question "how is it better?" or "which is better?" we have not strayed from the course. In our case, we followed the idea divide and rule: we crush the infrastructure into small pieces / bricks. This approach allows you to take an isolated piece of infrastructure, understand what it does, cover it with tests and change it without fear of breaking something.
It turns out that infrastructure testing becomes the cornerstone and here it is worth mentioning the infrastructure testing pyramid. Exactly the same idea that is in development, but for infrastructure: we are moving from cheap quick tests that check simple things, such as indentation, to expensive full-fledged tests that deploy the entire infrastructure.
Ansible testing attempts
Before we go to describe how Ansible was covered with tests on the project, I will describe the attempts and approaches that I happened to use earlier in order to understand the context of the decisions being made.
Day No. -997: SDS provision
The first time I tested Ansible was on a project to develop SDS (Software Defined Storage). There is a separate article on this topic How to break bicycles over crutches when testing your distribution, but in short, we ended up with an inverted testing pyramid and testing we spent 60-90 minutes on one role, which is a long time. The basis was e2e tests, i.e. we deployed a full-fledged installation and then tested it. What was even more aggravating was the invention of his own bicycle. But I must admit, this solution worked and allowed for a stable release.
Day #-701: Ansible and test kitchen
The development of the Ansible testing idea was the use of ready-made tools, namely test kitchen / kitchen-ci and inspec. The choice was determined by knowledge of Ruby (for more details, see the article on Habré: Do YML programmers dream of testing Ansible?) worked faster, about 40 minutes for 10 roles. We created a pack of virtual machines and ran tests inside.
In general, the solution worked, but there was some sediment due to heterogeneity. When the number of people tested was increased to 13 basic roles and 2 meta roles combining smaller roles, then suddenly the tests began to run for 70 minutes, which is almost 2 times longer. It was difficult to talk about XP (extreme programming) practices because... no one wants to wait 70 minutes. This was the reason for changing the approach
Day # -601: Ansible and molecule
Conceptually, this is similar to testkitchen, only we have moved role testing to docker and changed the stack. As a result, the time was reduced to a stable 20-25 minutes for 7 roles.
By increasing the number of tested roles to 17 and linting 45 roles, we ran this in 28 minutes on 2 jenkins slaves.
Day #167: Adding Ansible tests to the project
With a swoop, the task of refactoring, most likely, will not work. The task should be measurable so that you can break it down into small pieces and eat the elephant piece by piece with a teaspoon. There must be an understanding of whether you are moving in the right direction, whether it is still a long way to go.
In general, it doesn’t matter how it will be done, you can write on a piece of paper, you can put stickers on the closet, you can create tasks in Jira, or you can open Google Docs and write down the current status there. The legs grow from the fact that the process is not immediate, it will be long and tedious. It’s unlikely that anyone wants you to burn out of ideas, get tired, and become overwhelmed during refactoring.
Refactoring is simple:
Eat.
Sleep.
Code.
IaC test.
Repeat
and we repeat this until we reach the intended goal.
It may not be possible to start testing everything right away, so our first task was to start with linting and checking the syntax.
Day #181: Green Build Master
Linting is a small first step towards Green Build Master. This won’t break almost anything, but it will allow you to debug processes and make green builds in Jenkins. The idea is to develop habits among the team:
Red tests are bad.
I came to fix something and at the same time make the code a little better than it was before you.
Day #193: From lint to unit tests
Having built the process of getting the code into the master, you can begin the process of step-by-step improvement - replacing linting with launching roles, you can even do it without idempotency. You need to understand how to apply roles and how they work.
Day #211: From unit to integration tests
When unit tests cover most of the roles and everything is linted, you can move on to adding integration tests. Those. testing not a single brick in the infrastructure, but a combination of them, for example, a full-fledged instance configuration.
On jenkins, we generated a lot of stages that linted roles / playbooks in parallel, then unit tests in containers, and finally integration tests.
Jenkins + Docker + Ansible = Tests
Checkout repo and generate build stages.
Run lint playbook stages in parallel.
Run lint role stages in parallel.
Run syntax check role stages in parallel.
Run test role stages in parallel.
Lint role.
Check dependency on other roles.
Check syntax.
Create docker instance
Run molecule/default/playbook.yml.
Check idempotency.
Run integration tests
Finish
Day #271: Bus Factor
At first, a small group of people, a couple of people, was engaged in refactoring. They did a code review in the master. Over time, the team developed knowledge of how to write code and code review contributed to the spread of knowledge about the infrastructure and how it works. The highlight here was that the reviewers were selected in turn, according to the schedule, i.e. with some degree of probability you will climb into a new piece of infrastructure.
And it should be comfortable here. It is convenient to make a review, to see within what task it was done, the history of discussions. We have integrated jenkins + bitbucket + jira.
But as such, the review is not a panacea, somehow, we got into the master code that made us flapping tests:
Over time, there were more tests, builds ran slower, up to an hour in the worst case. On one of the retros there was a phrase like “it’s good that there are tests, but they are slow.” As a result, we abandoned integration tests on virtual machines and adapted them for Docker to make it faster. We also replaced testinfra with ansible verifier to reduce the number of tools used.
Strictly speaking, there was a set of measures:
Moving to docker.
Remove role testing, which is duplicated due to dependencies.
Increase the number of slaves.
Test run order.
Ability to lint EVERYTHING locally with one command.
As a result, Pipeline on jenkins was also unified
Generate build stages.
Lint all in parallel.
Run test role stages in parallel.
Finish.
Lessons Learned
Avoid global variables
Ansible uses global variables, there is a partial workaround in the form private_role_varsbut it is not a panacea.
Let me give you an example. Let us have role_a и role_b
The funny thing is that the result of the playbooks will depend on things that are not always obvious, for example, the order in which roles are listed. Unfortunately, this is Ansible's nature and the best thing to do is to use some conventions, for example, use only the variable described in this role inside the role.
GOOD: In the role for variables, use variables prefixed with the role name, this by looking at the inventory will make it easier to understand what is happening.
We agreed to use variable prefixes; it would not be superfluous to check that they are defined as we expect and, for example, were not overridden by an empty value
GOOD: Check variables.
- name: "Verify that required string variables are defined"
assert:
that: ahs_var is defined and ahs_var | length > 0 and ahs_var != None
fail_msg: "{{ ahs_var }} needs to be set for the role to work "
success_msg: "Required variables {{ ahs_var }} is defined"
loop_control:
loop_var: ahs_var
with_items:
- ahs_item1
- ahs_item2
- ahs_item3
Avoid hashes dictionaries, use flat structure
If a role expects a hash/dictionary in one of its parameters, then if we want to change one of the child parameters, we will need to override the entire hash/dictionary, which will increase configuration complexity.
Roles and playbooks must be idempotent because reduces configuration drift and the fear of breaking something. But if you are using molecule, then this is the default behavior.
Avoid using command shell modules
Using the shell module results in an imperative description paradigm, instead of the declarative one that is the core of Ansible.
Test your roles via molecule
Molecule is a very flexible thing, let's look at a few scenarios.
Molecule Multiple instances
В molecule.yml in the section platforms it is possible to describe set of hosts which to unroll.
In molecule it is possible to use ansible to check that the instance has been configured correctly, moreover, this has been the default since release 3. It's not as flexible as testinfra/inspec, but we can check that the contents of the file match our expectations:
Or deploy the service, wait for it to be available, and do a smoke test:
---
- name: Verify
hosts: solr
tasks:
- command: /blah/solr/bin/solr start -s /solr_home -p 8983 -force
- uri:
url: http://127.0.0.1:8983/solr
method: GET
status_code: 200
register: uri_result
until: uri_result is not failed
retries: 12
delay: 10
- name: Post documents to solr
command: /blah/solr/bin/post -c master /exampledocs/books.csv
Put complex logic into modules & plugins
Ansible advocates a declarative approach, so when you do code branching, data transformation, shell modules, the code becomes difficult to read. To combat this and keep it simple to understand, it would not be superfluous to combat this complexity by creating your own modules.
Summarize Tips & Tricks
Avoid global variables.
Prefix role variables.
Use loop control variable.
Check input variables.
Avoid hashes dictionaries, use flat structure.
Create idempotent playbooks & roles.
Avoid using command shell modules.
Test your roles via molecule.
Put complex logic into modules & plugins.
Conclusion
You can’t just take and refactor the infrastructure on the project, even if you have IaC. This is a long process that requires patience, time and knowledge.
UPD1 2020.05.01 20:30 - For primary profiling of playbooks, you can use callback_whitelist = profile_tasks to understand what exactly works for a long time. Then we go through ansible acceleration classic... You can also try mitogen UPD2 2020.05.03 16:34 — English version