From a “startup” to thousands of servers in a dozen data centers. How We Chased the Growth of Linux Infrastructure

If your IT infrastructure is growing too fast, sooner or later you will be faced with a choice - linearly increase human resources to support it or start automating. Until some point, we lived in the first paradigm, and then a long journey to Infrastructure-as-Code began.

From a “startup” to thousands of servers in a dozen data centers. How We Chased the Growth of Linux Infrastructure

Of course, NSPK is not a startup, but such an atmosphere reigned in the company in the first years of its existence, and these were very interesting years. My name is Kornyakov Dmitry, I have been maintaining a Linux infrastructure with high availability requirements for over 10 years. He joined the NSPK team in January 2016 and, unfortunately, did not see the very beginning of the company's existence, but came at a stage of great changes.

In general, we can say that our team supplies 2 products for the company. The first is infrastructure. Mail should go, DNS should work, and domain controllers should let you into servers that should not fall. The company's IT landscape is huge! These are business&mission critical systems, the accessibility requirements for some are 99,999. The second product is the servers themselves, physical and virtual. Existing ones need to be monitored, and new ones regularly delivered to customers from many departments. In this article, I want to focus on how we have developed the infrastructure that is responsible for the life cycle of servers.

Beginning of a journey

At the beginning of the journey, our technology stack looked like this:
CentOS 7 OS
FreeIPA Domain Controllers
Automation - Ansible(+Tower), Cobbler

All this was located in 3 domains spread over several data centers. In one data center - office systems and test sites, in the rest - PROD.

Creating servers at some point looked like this:

From a “startup” to thousands of servers in a dozen data centers. How We Chased the Growth of Linux Infrastructure

The CentOS VM template is minimal and the bare minimum is correct /etc/resolv.conf, the rest comes via Ansible.

CMDB - Excel.

If the server is physical, then instead of copying the virtual machine, the OS was installed on it using Cobbler - the MAC addresses of the target server are added to the Cobbler config, the server receives an IP address via DHCP, and then the OS is poured.

At first, we even tried to do some kind of configuration management in Cobbler. But over time, this began to bring problems with portability of configurations both to other data centers and to Ansible code for preparing a VM.

Ansible at that time was perceived by many of us as a convenient Bash extension and did not skimp on constructs using the shell, sed. In general, Bashsible. This eventually led to the fact that if the playbook for some reason did not work on the server, it was easier to delete the server, fix the playbook and roll it again. As a matter of fact, there was no versioning of scripts, portability of configurations too.

For example, we wanted to change some config on all servers:

  1. We change the configuration on existing servers in the logical segment / data center. Sometimes not in one day - availability requirements and the law of large numbers do not allow you to apply all changes at once. And some changes are potentially destructive and require a restart of anything - from services to the OS itself.
  2. Fixing in Ansible
  3. Fixing in Cobbler
  4. Repeat N times for each logical segment/data center

In order for all the changes to go smoothly, many factors had to be taken into account, and changes happen all the time.

  • Refactoring ansible code, configuration files
  • Changing internal best practices
  • Changes following the analysis of incidents/accidents
  • Changing security standards, both internal and external. For example, PCI DSS is updated with new requirements every year.

The growth of infrastructure and the beginning of the journey

The number of servers / logical domains / data centers grew, and with them the number of errors in the configurations. At some point, we came to three directions in the direction of which we need to develop configuration management:

  1. Automation. As far as possible, human error in repetitive operations should be avoided.
  2. Repeatability. Infrastructure is much easier to manage when it is predictable. The configuration of servers and tools for their preparation should be the same everywhere. This is also important for product teams - after testing, the application must be guaranteed to get into a productive environment configured similarly to the test one.
  3. Simplicity and transparency of changes in configuration management.

It remains to add a couple of tools.

We chose GitLab CE as our code repository, not least for its built-in CI/CD modules.

Vault of secrets - Hashicorp Vault, incl. for the great API.

Testing configurations and ansible roles - Molecule+Testinfra. Tests go much faster if you connect to ansible mitogen. In parallel, we started writing our own CMDB and an orchestrator for automatic deployment (pictured above Cobbler), but this is a completely different story, which my colleague and the main developer of these systems will tell about in the future.

Our choice:

Molecule + Testinfra
Ansible + Tower + AWX
World of Servers + DITNET(Own development)
Cobbler
Gitlab + GitLab runner
Hashicorp Vault

From a “startup” to thousands of servers in a dozen data centers. How We Chased the Growth of Linux Infrastructure

By the way, about ansible roles. At first it was alone, after several refactorings there were 17 of them. I strongly recommend breaking the monolith into idempotent roles, which can then be run separately, you can additionally add tags. We divided the roles by functionality - network, logging, packages, hardware, molecule etc. In general, adhered to the strategy below. I do not insist that this is the truth in a single instance, but it worked for us.

  • Copying servers from the golden image is evil!Of the main disadvantages - you do not know exactly what state the images are in now, and that all changes will come to all images in all virtualization farms.
  • Keep default configuration files to a minimum and agree with other departments that you are responsible for key system files, For example:
    1. Leave /etc/sysctl.conf empty, settings should only be in /etc/sysctl.d/. Your default in one file, custom for the application in another.
    2. Use override files to edit systemd units.
  • Template all configs and put them in their entirety, if possible, no sed and its analogues in playbooks
  • Refactoring the configuration management system code:
    1. Break tasks into logical entities and rewrite the monolith into roles
    2. Use linters! Ansible-lint, yaml-lint, etc
    3. Change your approach! No bashsible. Describe the state of the system
  • For all Ansible roles, you need to write tests in molecule and generate reports once a day.
  • In our case, after preparing the tests (of which there are more than 100), about 70000 errors were found. Fixed for a few months.From a “startup” to thousands of servers in a dozen data centers. How We Chased the Growth of Linux Infrastructure

Our implementation

So, ansible roles were ready, templated and checked by linters. And even the gits are raised everywhere. But the issue of reliable code delivery to different segments remained open. We decided to synchronize with scripts. Looks like that:

From a “startup” to thousands of servers in a dozen data centers. How We Chased the Growth of Linux Infrastructure

After the change has arrived, CI is launched, a test server is created, roles are rolled, tested by the molecule. If everything is OK, the code goes to the pro branch. But we don't apply new code to existing servers automatically. This is a kind of stopper, which is necessary for the high availability of our systems. And when the infrastructure becomes huge, the law of large numbers comes into play - even if you are sure that the change is harmless, it can lead to sad consequences.

There are also many options for creating servers. We ended up choosing custom python scripts. And for CI ansible:

- name: create1.yml - Create a VM from a template
  vmware_guest:
    hostname: "{{datacenter}}".domain.ru
    username: "{{ username_vc }}"
    password: "{{ password_vc }}"
    validate_certs: no
    cluster: "{{cluster}}"
    datacenter: "{{datacenter}}"
    name: "{{ name }}"
    state: poweredon
    folder: "/{{folder}}"
    template: "{{template}}"
    customization:
      hostname: "{{ name }}"
      domain: domain.ru
      dns_servers:
        - "{{ ipa1_dns }}"
        - "{{ ipa2_dns }}"
    networks:
      - name: "{{ network }}"
        type: static
        ip: "{{ip}}"
        netmask: "{{netmask}}"
        gateway: "{{gateway}}"
        wake_on_lan: True
        start_connected: True
        allow_guest_control: True
    wait_for_ip_address: yes
    disk:
      - size_gb: 1
        type: thin
        datastore: "{{datastore}}"
      - size_gb: 20
        type: thin
        datastore: "{{datastore}}"

This is what we have come to, the system continues to live and develop.

  • 17 ansible roles for setting up a server. Each of the roles is designed to solve a separate logical task (logging, auditing, user authorization, monitoring, etc.).
  • Role testing. Molecule + Test Infra.
  • Own development: CMDB + Orchestrator.
  • The server creation time is ~30 minutes, it is automated and practically does not depend on the task queue.
  • The same state / naming of the infrastructure in all segments - playbooks, repositories, virtualization elements.
  • Daily check of the status of servers with the generation of reports on discrepancies with the standard.

I hope my story will be useful to those who are at the beginning of the journey. What automation stack are you using?

Source: habr.com