🥇How to start testing Ansible, refactor a project in a year and not go crazy

This is the transcript Performances on DevOps-40 2020-03-18:

Starting from the second commit, any code becomes legacy, because initial ideas begin to diverge from harsh reality. This is neither good nor bad, it is a given that is difficult to argue with and must be lived with. Part of this process is refactoring. Refactoring Infrastructure as Code. Let the story begin on how to refactor Ansible in a year and not go crazy.

The birth of the Legacy

Day #1: Patient Zero

Once upon a time there was a conditional project. It had a Dev development team and Ops engineers. They were solving the same problem: how to deploy servers and run an application. The problem was that each team solved this problem in its own way. At the project, it was decided to use Ansible to synchronize knowledge between the Dev and Ops teams.

Day #89: The birth of the Legacy

Without noticing it themselves, they wanted to do it as best as possible, but it turned out to be legacy. How does this happen?

We have an urgent task here, we will do a dirty hack - then we will fix it.
Documentation can be omitted and everything is clear what is happening here.
I know Ansible / Python / Bash / Terraform! See how I can dodge!
I copied this from stackoverflow, Full Stack Overflow Developer, I don't know how it works, but it looks cool and solves the problem.

As a result, you can get a code of an incomprehensible type, for which there is no documentation, it is not clear what it does, whether it is needed, but the problem is that you need to develop it, refine it, add crutches with props, making the situation even worse.

- hosts: localhost
  tasks:
    - shell: echo -n Z >> a.txt && cat a.txt
      register: output
      delay: 1
      retries: 5
      until: not output.stdout.find("ZZZ")

Day #109: Awareness of the problem

The initially conceived and implemented IaC model no longer meets the requirements of users / business / other teams, and the time to make changes to the infrastructure ceases to be acceptable. At this moment, the understanding comes that it is time to take action.

IaC refactoring

Day #139: Do you really need refactoring?

Before you rush to refactor, you must answer a number of important questions:

Why do you need all this?
Do you have time?
Is knowledge enough?

If you don't know how to answer the questions, then the refactoring will end before it even starts, or it can only get worse. Because was an experience What I learned from testing 200 lines of infrastructure code), then a request came from the project to help fix the roles and cover them with tests.

Day #149: Preparing the refactoring

The first thing is to be prepared. Decide what we will do. To do this, we communicate, find problem points and figure out ways to solve them. We somehow fix the received concepts, for example, an article in confluence, so that when the question "how is it better?" or "which is better?" we have not strayed from the course. In our case, we followed the idea divide and rule: we crush the infrastructure into small pieces / bricks. This approach allows you to take an isolated piece of infrastructure, understand what it does, cover it with tests and change it without fear of breaking something.

It turns out that infrastructure testing becomes the cornerstone and here it is worth mentioning the infrastructure testing pyramid. Exactly the same idea that is in development, but for infrastructure: we are moving from cheap quick tests that check simple things, such as indentation, to expensive full-fledged tests that deploy the entire infrastructure.

Ansible testing attempts

Before we go to describe how Ansible was covered with tests on the project, I will describe the attempts and approaches that I happened to use earlier in order to understand the context of the decisions being made.

Day No. -997: SDS provision

The first time I tested Ansible was on a project to develop SDS (Software Defined Storage). There is a separate article on this topic
How to break bicycles over crutches when testing your distribution, but in short, we ended up with an inverted testing pyramid and testing we spent 60-90 minutes on one role, which is a long time. The basis was e2e tests, i.e. we deployed a full-fledged installation and then tested it. What was even more aggravating was the invention of his own bicycle. But I must admit, this solution worked and allowed for a stable release.

Day #-701: Ansible and test kitchen

The development of the Ansible testing idea was the use of ready-made tools, namely test kitchen / kitchen-ci and inspec. The choice was determined by knowledge of Ruby (for more details, see the article on Habré: Do YML programmers dream of testing Ansible?) worked faster, about 40 minutes for 10 roles. We created a pack of virtual machines and ran tests inside.

In general, the solution worked, but there was some sediment due to heterogeneity. When the number of people tested was increased to 13 basic roles and 2 meta roles combining smaller roles, then suddenly the tests began to run for 70 minutes, which is almost 2 times longer. It was difficult to talk about XP (extreme programming) practices because... no one wants to wait 70 minutes. This was the reason for changing the approach

Day # -601: Ansible and molecule

Conceptually, this is similar to testkitchen, only we have moved role testing to docker and changed the stack. As a result, the time was reduced to a stable 20-25 minutes for 7 roles.

By increasing the number of tested roles to 17 and linting 45 roles, we ran this in 28 minutes on 2 jenkins slaves.

Day #167: Adding Ansible tests to the project

With a swoop, the task of refactoring, most likely, will not work. The task should be measurable so that you can break it down into small pieces and eat the elephant piece by piece with a teaspoon. There must be an understanding of whether you are moving in the right direction, whether it is still a long way to go.

In general, it doesn’t matter how it will be done, you can write on a piece of paper, you can put stickers on the closet, you can create tasks in Jira, or you can open Google Docs and write down the current status there. The legs grow from the fact that the process is not immediate, it will be long and tedious. It’s unlikely that anyone wants you to burn out of ideas, get tired, and become overwhelmed during refactoring.

Refactoring is simple:

Eat.
Sleep.
Code.
IaC test.
Repeat

and we repeat this until we reach the intended goal.

It may not be possible to start testing everything right away, so our first task was to start with linting and checking the syntax.

Day #181: Green Build Master

Linting is a small first step towards Green Build Master. This won’t break almost anything, but it will allow you to debug processes and make green builds in Jenkins. The idea is to develop habits among the team:

Red tests are bad.
I came to fix something and at the same time make the code a little better than it was before you.

Day #193: From lint to unit tests

Having built the process of getting the code into the master, you can begin the process of step-by-step improvement - replacing linting with launching roles, you can even do it without idempotency. You need to understand how to apply roles and how they work.

Day #211: From unit to integration tests

When unit tests cover most of the roles and everything is linted, you can move on to adding integration tests. Those. testing not a single brick in the infrastructure, but a combination of them, for example, a full-fledged instance configuration.

On jenkins, we generated a lot of stages that linted roles / playbooks in parallel, then unit tests in containers, and finally integration tests.

Jenkins + Docker + Ansible = Tests

Checkout repo and generate build stages.
Run lint playbook stages in parallel.
Run lint role stages in parallel.
Run syntax check role stages in parallel.
Run test role stages in parallel.
1. Lint role.
2. Check dependency on other roles.
3. Check syntax.
4. Create docker instance
5. Run molecule/default/playbook.yml.
6. Check idempotency.
Run integration tests
Finish

Day #271: Bus Factor

At first, a small group of people, a couple of people, was engaged in refactoring. They did a code review in the master. Over time, the team developed knowledge of how to write code and code review contributed to the spread of knowledge about the infrastructure and how it works. The highlight here was that the reviewers were selected in turn, according to the schedule, i.e. with some degree of probability you will climb into a new piece of infrastructure.

And it should be comfortable here. It is convenient to make a review, to see within what task it was done, the history of discussions. We have integrated jenkins + bitbucket + jira.

But as such, the review is not a panacea, somehow, we got into the master code that made us flapping tests:

- get_url:
    url: "{{ actk_certs }}/{{ item.1 }}"
    dest: "{{ actk_src_tmp }}/"
    username: "{{ actk_mvn_user }}"
    password: "{{ actk_mvn_pass }}"
  with_subelements:
    - "{{ actk_cert_list }}"
    - "{{ actk_certs }}"
  delegate_to: localhost

- copy:
    src: "{{ actk_src_tmp }}/{{ item.1 }}"
    dest: "{{ actk_dst_tmp }}"
  with_subelements:
    - "{{ actk_cert_list }}"
    - "{{ actk_certs }}"

Then they fixed it, but the residue remained.

get_url:
    url: "{{ actk_certs }}/{{ actk_item }}"
    dest: "{{ actk_src_tmp }}/{{ actk_item }}"
    username: "{{ actk_mvn_user }}"
    password: "{{ actk_mvn_pass }}"
  loop_control:
    loop_var: actk_item
  with_items: "{{ actk_cert_list }}"
  delegate_to: localhost

- copy:
    src: "{{ actk_src_tmp }}/{{ actk_item }}"
    dest: "{{ actk_dst_tmp }}"
  loop_control:
    loop_var: actk_item
  with_items: "{{ actk_cert_list }}"

Day #311: Speeding up tests

Over time, there were more tests, builds ran slower, up to an hour in the worst case. On one of the retros there was a phrase like “it’s good that there are tests, but they are slow.” As a result, we abandoned integration tests on virtual machines and adapted them for Docker to make it faster. We also replaced testinfra with ansible verifier to reduce the number of tools used.

Strictly speaking, there was a set of measures:

Moving to docker.
Remove role testing, which is duplicated due to dependencies.
Increase the number of slaves.
Test run order.
Ability to lint EVERYTHING locally with one command.

As a result, Pipeline on jenkins was also unified

Generate build stages.
Lint all in parallel.
Run test role stages in parallel.
Finish.

Lessons Learned

Avoid global variables

Ansible uses global variables, there is a partial workaround in the form private_role_varsbut it is not a panacea.

Let me give you an example. Let us have role_a и role_b

# cat role_a/defaults/main.yml
---
msg: a

# cat role_a/tasks/main.yml
---
- debug:
    msg: role_a={{ msg }}

# cat role_b/defaults/main.yml
---
msg: b

# cat role_b/tasks/main.yml
---
- set_fact:
    msg: b
- debug:
    msg: role_b={{ msg }}

- hosts: localhost
  vars:
    msg: hello
  roles:
    - role: role_a
    - role: role_b
  tasks:
    - debug:
        msg: play={{msg}}

The funny thing is that the result of the playbooks will depend on things that are not always obvious, for example, the order in which roles are listed. Unfortunately, this is Ansible's nature and the best thing to do is to use some conventions, for example, use only the variable described in this role inside the role.

BAD: use global variable.

# cat roles/some_role/tasks/main.yml
---
debug:
  var: java_home

GOOD: IN defaults define the necessary variables and later use only them.

# cat roles/some_role/defaults/main.yml
---
r__java_home:
 "{{ java_home | default('/path') }}"

# cat roles/some_role/tasks/main.yml
---
debug:
  var: r__java_home

Prefix role variables

BAD: use global variable.

# cat roles/some_role/defaults/main.yml
---
db_port: 5432

GOOD: In the role for variables, use variables prefixed with the role name, this by looking at the inventory will make it easier to understand what is happening.

# cat roles/some_role/defaults/main.yml
---
some_role__db_port: 5432

Use loop control variable

BAD: Use standard variable in loops item, if this task/playbook is included somewhere, this may lead to unexpected behavior

---
- hosts: localhost
  tasks:
    - debug:
        msg: "{{ item }}"
      loop:
        - item1
        - item2

GOOD: Redefine a variable in a loop via loop_var.

---
- hosts: localhost
  tasks:
    - debug:
        msg: "{{ item_name }}"
      loop:
        - item1
        - item2
      loop_control:
        loop_var: item_name

Check input variables

We agreed to use variable prefixes; it would not be superfluous to check that they are defined as we expect and, for example, were not overridden by an empty value

GOOD: Check variables.

- name: "Verify that required string variables are defined"
  assert:
    that: ahs_var is defined and ahs_var | length > 0 and ahs_var != None
    fail_msg: "{{ ahs_var }} needs to be set for the role to work "
    success_msg: "Required variables {{ ahs_var }} is defined"
  loop_control:
    loop_var: ahs_var
  with_items:
    - ahs_item1
    - ahs_item2
    - ahs_item3

Avoid hashes dictionaries, use flat structure

If a role expects a hash/dictionary in one of its parameters, then if we want to change one of the child parameters, we will need to override the entire hash/dictionary, which will increase configuration complexity.

BAD: Use hash/dictionary.

---
user:
  name: admin
  group: admin

GOOD: Use flat variable structure.

---
user_name: admin
user_group: "{{ user_name }}"

Create idempotent playbooks & roles

Roles and playbooks must be idempotent because reduces configuration drift and the fear of breaking something. But if you are using molecule, then this is the default behavior.

Avoid using command shell modules

Using the shell module results in an imperative description paradigm, instead of the declarative one that is the core of Ansible.

Test your roles via molecule

Molecule is a very flexible thing, let's look at a few scenarios.

Molecule Multiple instances

В molecule.yml in the section platforms it is possible to describe set of hosts which to unroll.

---
    driver:
      name: docker
    platforms:
      - name: postgresql-instance
        hostname: postgresql-instance
        image: registry.example.com/postgres10:latest
        pre_build_image: true
        override_command: false
        network_mode: host
      - name: app-instance
        hostname: app-instance
        pre_build_image: true
        image: registry.example.com/docker_centos_ansible_tests
        network_mode: host

Accordingly, these hosts can then be converge.yml use:

---
- name: Converge all
  hosts: all
  vars:
    ansible_user: root
  roles:
    - role: some_role

- name: Converge db
  hosts: db-instance
  roles:
    - role: some_db_role

- name: Converge app
  hosts: app-instance
  roles:
    - role: some_app_role

Ansible verifier

In molecule it is possible to use ansible to check that the instance has been configured correctly, moreover, this has been the default since release 3. It's not as flexible as testinfra/inspec, but we can check that the contents of the file match our expectations:

---
- name: Verify
  hosts: all
  tasks:
    - name: copy config
      copy:
        src: expected_standalone.conf
        dest: /root/wildfly/bin/standalone.conf
        mode: "0644"
        owner: root
        group: root
      register: config_copy_result

    - name: Certify that standalone.conf changed
      assert:
        that: not config_copy_result.changed

Or deploy the service, wait for it to be available, and do a smoke test:

---
  - name: Verify
    hosts: solr
    tasks:
      - command: /blah/solr/bin/solr start -s /solr_home -p 8983 -force
      - uri:
          url: http://127.0.0.1:8983/solr
          method: GET
          status_code: 200
        register: uri_result
        until: uri_result is not failed
        retries: 12
        delay: 10
      - name: Post documents to solr
        command: /blah/solr/bin/post -c master /exampledocs/books.csv

Put complex logic into modules & plugins

Ansible advocates a declarative approach, so when you do code branching, data transformation, shell modules, the code becomes difficult to read. To combat this and keep it simple to understand, it would not be superfluous to combat this complexity by creating your own modules.

Summarize Tips & Tricks

Avoid global variables.
Prefix role variables.
Use loop control variable.
Check input variables.
Avoid hashes dictionaries, use flat structure.
Create idempotent playbooks & roles.
Avoid using command shell modules.
Test your roles via molecule.
Put complex logic into modules & plugins.

Conclusion

You can’t just take and refactor the infrastructure on the project, even if you have IaC. This is a long process that requires patience, time and knowledge.

Links

UPD1 2020.05.01 20:30 - For primary profiling of playbooks, you can use callback_whitelist = profile_tasks to understand what exactly works for a long time. Then we go through ansible acceleration classic... You can also try mitogen
UPD2 2020.05.03 16:34 — English version

Source: habr.com

How to start testing Ansible, refactor a project in a year and not go crazy