How to build a full-fledged in-house development using DevOps - VTB experience

DevOps practices work. We saw this for ourselves when we reduced the installation time of releases by 10 times. In the FIS Profile system that we use at VTB, installation now takes 90 minutes instead of 10. The release build time has decreased from two weeks to two days. At the same time, the number of permanent implementation defects has fallen almost to a minimum. To get away from "manual labor" and eliminate dependence on the vendor, we had to go through work with crutches and find unexpected solutions. Under the cut is a detailed story about how we built a full-fledged internal development.

How to build a full-fledged in-house development using DevOps - VTB experience
 

Prologue: DevOps is a philosophy

Over the past year, we have done a lot of work to organize internal development and implementation of DevOps practices at VTB:

  • Built internal development processes for 12 systems;
  • We launched 15 pipelines, four of which were brought to production;
  • Automated 1445 test scenarios;
  • Successfully implemented a number of releases prepared by in-house teams.

One of the most difficult for the organization of in-house development and implementation of DevSecOps practices was the FIS Profile system, a retail product processor based on a non-relational DBMS. Nevertheless, we were able to build the development, launch the pipeline, install individual off-release packages on the product, and learned how to build releases. The task was not easy, but interesting and without obvious restrictions in implementation: here is the system - you need to build in-house development. The only condition is to use the CD before a productive environment.

At first, the implementation algorithm seemed simple and clear:

  • We develop initial development expertise and achieve an acceptable level of quality from the code team without gross defects;
  • We integrate as much as possible into existing processes;
  • To transfer the code between the obvious stages, we cut the pipeline and rest it with one of the ends on the prod.

During this time, the development team of the required size must develop skills and increase the share of their contribution to releases to an acceptable level. And that's it, you can consider the task completed.

It would seem that it’s quite an energy-efficient way to achieve the desired result: here is DevOps, here are the team’s performance metrics, here is the expertise gained ... But in practice, we received another confirmation that DevOps is still about philosophy, and not “attached to the gitlab process, ansible, nexus, and so on."

Having once again analyzed the action plan, we realized that we were building a kind of vendor outsourcing within ourselves. Therefore, process reengineering was added to the algorithm described above, as well as the development of expertise along the entire development route to achieve a leading role in this process. Not the easiest option, but this is the way of ideologically correct development.
 

How to start inhouse development 

We had to work with a far from the most friendly system. Architecturally, it was one large non-relational DBMS, consisted of many separate executable objects (scripts, procedures, batches, etc.) that were called as needed, and worked according to the black box principle: receives a request - issues an answer. Other difficulties worth noting include:

  • Exotic Language (MUMPS);
  • Console interface;
  • Lack of integration with popular automation tools and frameworks;
  • The amount of data in tens of terabytes;
  • Load over 2 million operations per hour;
  • Significance - Business-Critical.

At the same time, there was no source code repository on our side. At all. Documentation was available, but all key knowledge and competencies were on the side of the external organization.
We started to master the development of the system practically from scratch, taking into account its features and low distribution. Started in October 2018:

  • Studied the documentation and the basics of code generation;
  • We studied the short development course received from the vendor;
  • Mastered the initial development skills;
  • Compiled a training manual for new team members;
  • Agreed on the inclusion of the team in the work in the "combat" mode;
  • Solved the issue with code quality control;
  • Organized a stand for development.

We spent three months developing expertise and immersing ourselves in the system, and since the beginning of 2019, inhouse development has begun its movement towards a brighter future, sometimes with a creak, but confidently and purposefully.

Repository migration and autotests

The first task of DevOps is the repository. We quickly agreed on granting access, but it was necessary to migrate from the current SVN with one trunk branch to the target Git for us, switching to a multi-branch model and developing Git Flow. And we also have 2 teams with their own infrastructure, plus part of the vendor's team abroad. I had to live with two Git'ami and ensure synchronization. In this situation, it was the lesser of two evils.

The migration of the repository was repeatedly postponed, it was completed only by April, not without the help of colleagues from the front line. With Git Flow, we decided not to be smart at first, we settled on the classic scheme with hotfix, develop and release. They decided to abandon master (aka prod-like). Below we will explain why this option turned out to be optimal for us. As a worker, we used an external repository owned by the vendor, common for two teams. It synchronized with the internal repository on schedule. Now with Git and Gitlab it was possible to automate processes.

The issue of autotests was solved surprisingly easily - we were provided with a ready-made framework. Taking into account the peculiarities of the system, calling a separate operation was an understandable part of the business process and, in parallel, was a unit test. It remained to prepare the test data and set the desired order for calling the scripts with the evaluation of the results. As the list of scenarios filled out, formed on the basis of statistics on the execution of operations, the criticality of processes and the existing regression methodology, automated tests began to appear. Now it was possible to start building the pipeline.

How it was: the model before automation

The current model of the implementation process is a different story. Each revision was manually transferred as an independent incremental installation package. Next was manual registration in Jira and manual installation on environments. For individual packages, everything looked clear, but with the preparation of the release, things were more complicated.

The assembly was carried out at the level of individual deliveries, which were independent objects. Any change is a new supply. Among other things, 60–70 technical versions were added to the 10-15 packages of the main composition of the release - versions obtained by adding or excluding something from the release and reflecting changes in the sale by non-releases.

The objects within the deliveries overlapped, especially in the executable code, which was less than half unique. There were many dependencies both on the already delivered code and on the one that was just planned to be installed. 

To get the right version of the code, it was necessary to strictly follow the installation order, during which the objects were physically overwritten many times, some 10-12 times.

After installing a batch of packages, you had to manually follow the instructions to initialize the configuration parameters. The release was assembled and installed by the vendor. The composition of the release was specified almost until the moment of implementation, which entailed the creation of "decoupling" packages. As a result, a significant part of the deliveries moved from release to release with its tail of “decouplings”.

Now it is clear that with this approach - assembling the release puzzle at the package level - a single master branch did not make practical sense. Installation on prod took from one and a half to two hours of manual labor. It's good that at least at the level of the installer, the order of processing objects was set: fields and structures came in earlier than the data for them and procedures. However, this only worked within a separate package.

The logical result of this approach was mandatory installation defects in the form of crooked versions of objects, redundant code, missing instructions and unaccounted for mutual influences of objects, which were feverishly eliminated after the release. 

First Updates: Build by Commit and Delivery

Automation began by passing the code through a pipe along this route:

  • Pick up the finished supply from the storage;
  • Install it on a dedicated environment;
  • Run autotests;
  • Evaluate the result of the installation;
  • Call the next pipeline on the test team side.

The next pipeline is to register the issue in Jira and wait for the command to be poured into the selected testing loops, which depend on the timing of the implementation of the issue. Trigger - a letter about the readiness of delivery to a given address. This, of course, was an obvious crutch, but it was necessary to start somewhere. Since May 2019, the transfer of code with checks on our environments has begun. The process has begun, it remains to bring it into a decent form:

  • Each revision runs on a separate branch that matches the install package and is merged into the target master branch;
  • The pipeline launch trigger is the appearance of a new commit in the master branch through a merge request, which is closed by the maintainers from the inhouse team;
  • Repositories sync every five minutes;
  • The assembly of the installation package starts - using the assembler received from the vendor.

After that, there were already existing steps for checking and transferring the code, for running the pipe and building on our side.

This option was launched in July. The difficulties of the transition led to some dissatisfaction with the vendor and the front line, but over the next month we managed to remove all the rough edges and improve the process in the teams. We have a build by commit and delivery.
In August, we managed to perform the first installation of a separate package for production using our pipeline, and since September, without exception, all installations of individual out-of-release packages have been performed through our CD tool. In addition, we managed to achieve a share of inhouse tasks in 40% of the composition of the release with a smaller team than that of the vendor - this is a definite success. The most serious task remained - to build and install the release.

Final Solution: Cumulative Installation Packages 

We understood perfectly well that scripting a vendor's instructions was so-so automation, we had to rethink the process itself. The solution lay on the surface - to collect a cumulative delivery from the release branch with all the objects of the required versions.

We started with proof of concept: we put together a release package according to the composition of the past implementation and installed it on our environments. Everything worked out, the concept turned out to be viable. Next, we solved the issue with scripting initializing settings and including them in the commit. We prepared a new package and tested it already on test environments as part of the loop update. The installation was successful, albeit with a wide range of comments from the implementation team. But the main thing is that we were given the go-ahead to go productive in the November release with our build.

There was a little more than a month left, hand-picked supplies transparently hinted that time was running out. They decided to build the assembly from the release branch, but why branch it out? We don’t have a Prod-like, and the existing branches are no good - a lot of extra code. We urgently need to cut prod-like, and this is over three thousand commits. Collecting by hand is not an option at all. We sketched a script that runs through the production installation log and collects commits into the branch. From the third time it worked correctly, and after “finishing with a file” the branch was ready. 

The installation package assembler wrote their own, they did it in a week. Then I had to modify the installer from the core-functionality of the system, since it is open-source. After a series of checks and improvements, the result was recognized as successful. In the meantime, the composition of the release took shape, for the correct installation of which it was necessary to align the test circuit with the productive one, a separate script was written for this.

Naturally, there were many comments on the first installation, but in general, the code "stopped". And from about the third installation, everything began to look good. The control of the composition and versioning of objects was monitored separately in manual mode, which at this stage was quite justified.

An additional complication was the large number of off-releases that had to be accounted for. But with the Prod-like branch and Rebase, the task became transparent.

From the first time, quickly and without errors

By the time of the release, we approached with an optimistic attitude and more than a dozen successful installations on different circuits. But literally a day before the deadline, it turned out that the vendor did not complete the work on preparing the release for installation in the accepted way. If for some reason our build fails, the release will be broken. And by our efforts, which is especially unpleasant. We had no way out. Therefore, we considered alternative options, prepared action plans and began installation.

Surprisingly, the entire release, consisting of more than 800 objects, ran correctly, the first time and in just 10 minutes. For an hour we checked the logs in search of errors, but did not find any.

The whole next day, silence reigned in the release chat: no implementation problems, crooked versions, or "left" code. It was even a bit awkward. Later, individual comments came out, but against the background of other systems and previous experience, their number and priority were noticeably lower.

An additional effect of the cumulative was an increase in the quality of assembly and testing. Due to multiple installations of the full release, build defects and deployment errors were detected in a timely manner. Testing in full release configurations made it possible to additionally identify defects in the mutual influence of objects that did not appear during incremental installations. It was definitely a success, especially considering our 57% contribution to the release.

Results and conclusions

In less than a year, we have achieved:

  • Build a full-fledged internal development using an exotic system;
  • Eliminate critical vendor dependency;
  • Run CI/CD for a very unfriendly legacy;
  • Raise the implementation processes to a new technical level;
  • Significantly reduce deployment time;
  • Significantly reduce the number of implementation errors;
  • Confidently declare yourself as a leading development expert.

Of course, much of what has been described looks like outright kostylization, but these are the features of the system and the process limitations that exist in it. At the moment, the changes have affected the products and services of IS Profile (master accounts, plastic cards, savings accounts, escrow, cash loans), but potentially the approach can be applied to any IS for which the task of implementing DevOps practices has been set. The cumulative model can be safely replicated for subsequent implementations (including out-of-release) from a variety of deliveries.

Source: habr.com

Add a comment