Thriller about setting up servers without miracles with Configuration Management

It was close to the New Year. Children all over the country have already sent letters to Santa Claus or made gifts for themselves, and their main performer, one of the largest retailers, was preparing for the apotheosis of sales. In December, the load on its data center increases several times. Therefore, the company decided to modernize the data center and put into operation several dozen new servers instead of equipment whose service life was ending. This is where the fairy tale ends, against the backdrop of swirling snowflakes, and the thriller begins.

Thriller about setting up servers without miracles with Configuration Management
The equipment came to the site a few months before the peak of sales. The operation service, of course, knows how and what to configure on the servers in order to bring them into the production environment. But we needed to automate this and eliminate the human factor. In addition, the servers were replaced prior to the migration of a set of SAP systems critical to the company.

The commissioning of new servers was strictly tied to the deadline. And to move it meant jeopardizing both the shipments of a billion gifts and the migration of systems. Even a team consisting of Father Frost, Santa Claus could not change the date - you can transfer the SAP system for warehouse management only once a year. From December 31 to January 1, the retailer's huge warehouses, like 20 football fields in total, stop their work for 15 hours. And this is the only period of time for the system to move. We did not have the right to make a mistake with the introduction of servers.

Let me explain right away: my story reflects such a toolkit and configuration management process that our team uses.

The configuration management complex consists of several levels. The key component is the CMS system. In industrial operation, the absence of one of the levels would inevitably lead to unpleasant miracles.

OS installation management

The first level is a system for managing the installation of operating systems on physical and virtual servers. It creates basic OS configurations, eliminating the human factor.

With the help of this system, we received typical server instances with OS suitable for further automation. During the “pouring”, they received a minimum set of local users and public SSH keys, as well as a consistent OS configuration. We were able to manage servers through the CMS for sure, and we were sure that there were no surprises “downstairs”, at the OS level.

The "maximum" goal for a plant management system is to automatically configure servers from the BIOS/Firmware level down to the OS. A lot here depends on the equipment and setup tasks. For heterogeneous equipment, consider REDFISH API. If all the hardware is from one vendor, then it is often more convenient to use ready-made management tools (for example, HP ILO Amplifier, DELL OpenManage, etc.).

To install the OS on physical servers, we used the well-known Cobbler, which defines a set of installation profiles agreed with the operation service. When adding a new server to the infrastructure, the engineer tied the server's MAC address to the required profile in Cobbler. At the first boot over the network, the server received a temporary address and a fresh OS. Then it was transferred to the target VLAN / IP addressing and continued to work already there. Yes, changing the VLAN takes time and requires coordination, but it gives additional protection against accidental server installation in a production environment.

We created virtual servers based on templates prepared using HashiCorp Packer. The reason was the same: to prevent possible human error when installing the OS. But, unlike physical servers, Packer allows you not to use PXE, network boot and VLAN change. This facilitated and simplified the creation of virtual servers.

Thriller about setting up servers without miracles with Configuration Management
Rice. 1. Manage the installation of operating systems.

Secret Management

Any configuration management system contains data that should be hidden from ordinary users, but needed to prepare systems. These are passwords for local users and service accounts, certificate keys, various API Tokens, etc. They are usually called “secrets”.

If it is not determined from the very beginning where and how to store these secrets, then, depending on the severity of information security requirements, the following storage methods are likely:

  • directly in the configuration control code or in files in the repository;
  • in specialized configuration management tools (for example, Ansible Vault);
  • in CI / CD systems (Jenkins / TeamCity / GitLab / etc.) or in configuration management systems (Ansible Tower / Ansible AWX);
  • also secrets can be transferred on "manual control". For example, they are laid out in a specified place, and then they are used by configuration management systems;
  • various combinations of the above.

Each method has its drawbacks. The main one is the lack of secret access policies: it is impossible or difficult to determine who can use certain secrets. Another disadvantage is the lack of access audit and a full life cycle. How to quickly replace, for example, a public key that is written in the code and in a number of related systems?

We used the centralized secret storage HashiCorp Vault. This allowed us:

  • keep secrets safe. They are encrypted, and even if someone gains access to the Vault database (for example, by restoring it from a backup), they will not be able to read the secrets stored there;
  • organize secret access policies. Users and applications only have access to the secrets "allocated" to them;
  • audit access to secrets. Any action on secrets is recorded in the Vault audit log;
  • organize a full-fledged "life cycle" of working with secrets. You can create them, revoke them, set an expiration date, and so on.
  • easy to integrate with other systems that need access to secrets;
  • and also apply end-to-end encryption, one-time passwords for the OS and database, certificates of authorized centers, etc.

Now let's move on to the central authentication and authorization system. It was possible to do without it, but administering users in many related systems is too non-trivial. We have configured authentication and authorization through the LDAP service. Otherwise, in the same Vault, you would have to continuously issue and keep records of authentication tokens for users. And deleting and adding users would turn into a quest “did I create/delete this account everywhere?”

We add one more layer to our system: secret management and central authentication / authorization:

Thriller about setting up servers without miracles with Configuration Management
Rice. 2. Managing secrets.

Configuration management

We got to the core - to the CMS system. In our case, this is a bundle of Ansible and Red Hat Ansible AWX.

Ansible can be replaced by Chef, Puppet, SaltStack. We chose Ansible based on several criteria.

  • First, it's versatility. A set of ready-made modules for control it is impressive. And if it is not enough, you can search on GitHub and Galaxy.
  • Secondly, there is no need to install and maintain agents on managed equipment, to prove that they do not interfere with the load, and to confirm the absence of "bookmarks".
  • Thirdly, Ansible has a low entry threshold. A competent engineer will write a working playbook literally on the first day of working with the product.

But one Ansible in an industrial environment was not enough for us. Otherwise, there would be many problems with restricting access and auditing the actions of administrators. How to restrict access? After all, it was necessary for each division to manage (read - run the Ansible playbook) "its own" set of servers. How to allow only certain employees to run specific Ansible playbooks? Or how to keep track of who ran a playbook without running multiple local OUs on servers and hardware running Ansible?

The lion's share of these issues is solved by Red Hat Ansible Tower, or his open-source upstream project Ansible AWX. Therefore, we preferred it for the customer.

And one more touch to the portrait of our CMS system. Ansible playbook should be kept in code repository management systems. We have it GitLab CE.

So, the configurations themselves are managed by the Ansible / Ansible AWX / GitLab bundle (see Fig. 3). Of course, AWX / GitLab are integrated with a single authentication system, and Ansible playbook is integrated with HashiCorp Vault. Configurations get into the production environment only through Ansible AWX, in which all the “rules of the game” are set: who and what can configure, where to get the configuration management code for the CMS, etc.

Thriller about setting up servers without miracles with Configuration Management
Rice. 3. Configuration management.

Test management

Our configuration is presented in the form of code. Therefore, we are forced to play by the same rules as software developers. We needed to organize development processes, continuous testing, delivery and application of the configuration code to production servers.

If this is not done right away, then the written roles for the configuration would either stop supporting and changing, or would stop running in production. The cure for this pain is known, and it justified itself in this project:

  • each role is covered by unit tests;
  • tests are run automatically with any change in the code that manages configurations;
  • changes in the configuration management code get into the production environment only after all tests and code review have been successfully passed.

Code development and configuration management have become calmer and more predictable. To organize continuous testing, we used the GitLab CI / CD toolkit, and as a framework for organizing tests, we took Ansible Molecule.

For any change in the configuration management code, GitLab CI/CD calls Molecule:

  • it checks the syntax of the code,
  • raises a Docker container,
  • applies the modified code to the created container,
  • checks the role for idempotency and runs tests for this code (the granularity here is at the ansible role level, see Fig. 4).

We delivered configurations to the production environment using Ansible AWX. Operations engineers applied configuration changes through predefined templates. AWX itself “requested” the latest version of the code from the GitLab master branch with each application. So we excluded the use of untested or outdated code in the production environment. Naturally, the code got into the master branch only after testing, review and approval.

Thriller about setting up servers without miracles with Configuration Management
Rice. 4. Automatic testing of roles in GitLab CI/CD.

There is another problem related to the operation of production systems. In real life, it is very difficult to make configuration changes only through the CMS code. There are emergency situations when an engineer must change the configuration "here and now", without waiting for code editing, testing, approval, etc.

As a result, due to manual changes, there are discrepancies in the configuration on the same type of equipment (for example, HA cluster nodes have a different configuration of sysctl settings). Or the actual configuration on the equipment differs from the one specified in the CMS code.

Therefore, in addition to continuous testing, we check production environments for configuration discrepancies. We chose the simplest option: running the CMS configuration code in “dry run” mode, that is, without applying changes, but with notification of any discrepancies between the planned and actual configuration. We implemented this by periodically running all Ansible playbooks with the "--check" option on production servers. Ansible AWX, as always, is responsible for the launch and relevance of the playbook (see Fig. 5):

Thriller about setting up servers without miracles with Configuration Management
Rice. 5. Checks for configuration discrepancies in Ansible AWX.

After the checks, AWX sends a discrepancy report to administrators. They study the problematic configuration and then fix it through the adjusted playbooks. This is how we keep the configuration in the production environment and the CMS always up to date and in sync. This eliminates the unpleasant "miracles" when the CMS code is used on "combat" servers.

We now have an important testing layer consisting of Ansible AWX/GitLab/Molecule (Figure 6).

Thriller about setting up servers without miracles with Configuration Management
Rice. 6. Test management.

Difficult? I do not argue. But such a complex configuration management has become an exhaustive answer to many questions related to the automation of server configuration. Now the retailer always has a strictly defined configuration for typical servers. CMS, unlike an engineer, will not forget to add the necessary settings, create users and perform dozens or hundreds of required settings.

There are no "secret knowledge" in the settings of servers and environments today. All the necessary features are reflected in the playbook. No more creativity and vague instructions: "set as normal Oracle, but there you need to set a couple of sysctl settings, and add users with the desired UID. Ask the guys from exploitation, they know».

The ability to detect configuration discrepancies and correct them in advance gives peace of mind. Without a configuration management system, this usually looks different. Problems accumulate until one day they "shoot" in production. Then debriefing is carried out, configurations are checked and corrected. And the cycle repeats again

And of course, we have accelerated the launch of servers in operation from several days to hours.

Well, on New Year's Eve itself, when children happily unwrapped gifts and adults made wishes to the chimes, our engineers migrated the SAP system to new servers. Even Santa Claus will say that the best miracles are well-prepared.

PS Our team often encounters the fact that customers want to solve the problem of configuration management as simply as possible. Ideally, as if by magic - with one tool. But in life everything is more complicated (yes, silver bullets have not been delivered again): you have to create a whole process using tools that are convenient for the customer’s team.

Author: Sergey Artemov, department architect DevOps solutions Jet Infosystems

Source: habr.com

Add a comment