From scripts to our own platform: how we automated development at CIAN

From scripts to our own platform: how we automated development at CIAN

At RIT 2019, our colleague Alexander Korotkov made report about development automation in CYAN: to simplify life and work, we use our own Integro platform. It monitors the life cycle of tasks, relieves developers of routine operations and significantly reduces the number of bugs in production. In this post, we will supplement Alexander's report and tell you how we went from simple scripts to combining open source products through our own platform and what a separate automation team does for us.
 

Zero level

β€œThere is no zero level, I don’t know this”
Master Shifu from m / f "Kung Fu Panda"

Automation at CIAN began 14 years after the company was founded. At that time there were 35 people in the development team. Hard to believe, right? Of course, in some form, automation still existed, but a separate area for continuous integration and code delivery began to form precisely in 2015. 

At that time, we had a huge monolith of Python, C# and PHP deployed on Linux/Windows servers. To deploy this monster, we had a set of scripts that we ran manually. There was also a build of a monolith that brought pain and suffering due to conflicts when merging branches, fixing defects and rebuilding "with a different set of tasks in the build." Simplified, the process looked like this:

From scripts to our own platform: how we automated development at CIAN

We were not happy with this and wanted to build a repeatable, automated and manageable build and deployment process. To do this, we needed a CI / CD system, and we chose between the free version of Teamcity and the free version of Jenkins, since we worked with them and both suited us in terms of feature set. We chose Teamcity as a more recent product. At that time, we did not yet use the microservice architecture and did not count on a large number of tasks and projects.

We come to the idea of ​​our own system

The introduction of Teamcity removed only part of the manual work: there was still the creation of Pull Requests, the promotion of tasks by status in Jira, and the selection of tasks for release. The Teamcity system could no longer cope with this. It was necessary to choose the path of further automation. We considered options for working with scripts in Teamcity or switching to third-party automation systems. But in the end, we decided that we needed maximum flexibility, which only our own solution gives. This is how the first version of the internal automation system called Integro appeared.

Teamcity focuses on run-level automation of build and deployment processes, while Integro focuses on top-level automation of development processes. It was necessary to combine work with tasks in Jira with processing the associated source code in Bitbucket. At this stage, their own workflows began to appear inside Integro for working with tasks of different types. 

Due to the increase in automation in business processes, the number of projects and runs in Teamcity has grown. So a new problem came up: one free Teamcity instance was not enough (3 agents and 100 projects), we added another instance (3 more agents and 100 projects), then another. As a result, we got a system of several clusters, which was difficult to manage:

From scripts to our own platform: how we automated development at CIAN

When the question of 4 instances arose, we realized that we couldn’t live like this any longer, because the total costs for supporting 4 instances no longer fit into any framework. There was a question of buying a paid Teamcity or choosing a free Jenkins. We did calculations for instances and automation plans and decided that we would live on Jenkins. After a couple of weeks, we switched to Jenkins and got rid of some of the headaches associated with maintaining multiple Teamcity instances. Therefore, we were able to focus on the development of Integro and finishing Jenkins for ourselves.

With the growth of basic automation (in the form of automatic creation of Pull Requests, collection and publication of Code coverage and other checks), there was a strong desire to abandon manual releases as much as possible and give this work to robots. In addition, the company began moving to microservices, which required frequent releases, and separately from each other. So we gradually came to automatic releases of our microservices (for now, we release the monolith manually due to the complexity of the process). But, as is usually the case, a new complication arose. 

We automate testing

From scripts to our own platform: how we automated development at CIAN

Due to release automation, development processes have accelerated, and in part by skipping some stages of testing. And this led to a temporary loss of quality. It sounds trite, but along with the acceleration of releases, it was necessary to change the methodology of product development. It was necessary to think about automating testing, instilling personal responsibility (here we are talking about β€œaccepting the idea in the head”, and not monetary penalties) of the developer for the released code and bugs in it, as well as the decision to release / not release the task through automatic deployment. 

Eliminating quality problems, we came to two important decisions: we began to conduct canary testing and introduced automatic monitoring of the error background with automatic response to its excess. The first solution made it possible to find obvious bugs before the code fully entered production, the second one reduced the response time to problems in production. Mistakes, of course, happen, but we spend most of our time and effort not on correcting, but on minimizing. 

Automation Team

Now we have a staff of 130 developers, and we continue to grow. The continuous integration and code delivery team (hereinafter referred to as the Deploy and Integration or DI team) consists of 7 people and works in 2 directions: the development of the Integro automation platform and DevOps. 

DevOps is responsible for the Dev/Beta environments of the CIAN site, the Integro environments, helps developers with problem solving and develops new approaches to scaling environments. The Integro development direction deals with both Integro itself and related services, such as plugins for Jenkins, Jira, Confluence, and also develops auxiliary utilities and applications for development teams. 

The DI team works in conjunction with the Platform team, which develops the architecture, libraries, and development approaches within the company. Along with this, any developer inside CIAN can contribute to automation, for example, make micro-automation for the needs of the team or share a cool idea on how to make automation even better.

Layer cake automation in CYAN

From scripts to our own platform: how we automated development at CIAN

All systems involved in automation can be divided into several layers:

  1. External systems (Jira, Bitbucket, etc.). Development teams work with them.
  2. Integro platform. Most often, developers do not work with it directly, but it is she who supports the work of all automation.
  3. Delivery, orchestration and discovery services (for example, Jeknins, Consul, Nomad). With their help, we deploy the code on servers and ensure the work of services with each other.
  4. Physical layer (servers, OS, related software). Our code works at this level. It can be either a physical server or a virtual one (LXC, KVM, Docker).

Based on this concept, we divide the areas of responsibility within the DI team. The first two levels are in the area of ​​responsibility of the Integro development direction, and the last two levels are already in the area of ​​responsibility of DevOps. This separation allows you to focus on tasks and does not interfere with the interaction, because we are close to each other and constantly exchange knowledge and experience.

Full

Let's focus on Integro and start with the technology stack:

  • CentOS 7
  • Docker + Nomad + Consul + Vault
  • Java 11 (the old Integro monolith will remain in Java 8)
  • Spring Boot 2.X + Spring Cloud Config
  • PostgreSQL 11
  • Rabbit MQ 
  • Apache Ignite
  • Camunda (embedded)
  • Grafana + Graphite + Prometheus + Jaeger + ELK
  • Web UI: React (CSR) + MobX
  • SSO: Keycloak

We adhere to the principle of microservice development, although we have a legacy in the form of a monolith of the early version of Integro. Each microservice spins in its own docker container, the services communicate with each other via HTTP requests and RabbitMQ messages. Microservices find each other through Consul and execute a request to it, passing authorization through SSO (Keycloak, OAuth 2/OpenID Connect).

From scripts to our own platform: how we automated development at CIAN

As a real-world example, consider an interaction with Jenkins, which consists of the following steps:

  1. The workflow management microservice (hereinafter Flow-microservice) wants to run a build in Jenkins. To do this, it finds the IP:PORT of the Jenkins integration microservice (hereinafter referred to as the Jenkins microservice) through Consul and sends an asynchronous request to it to start the assembly in Jenkins.
  2. After receiving the request, the Jenkins microservice generates and returns a Job ID in response, by which it will then be possible to identify the result of the work. Along with that, it triggers the build in Jenkins via a REST API call.
  3. Jenkins performs the build and, after it completes, sends a webhook with the results of execution to the Jenkins microservice.
  4. The Jenkins microservice, having received a webhook, generates a message about the end of processing the request and attaches the execution results to it. The generated message is sent to the RabbitMQ queue.
  5. Through RabbitMQ, the published message gets to the Flow microservice, which learns about the result of processing its task by comparing the Job ID from the request and the received message.

Now we have about 30 microservices, which can be divided into several groups:

  1. Configuration management.
  2. Informing and interacting with users (messengers, mail).
  3. Working with source code.
  4. Integration with deployment tools (jenkins, nomad, consul, etc.).
  5. Monitoring (releases, bugs, etc.).
  6. Web utilities (UI for managing test environments, collecting statistics, etc.).
  7. Integration with task trackers and similar systems.
  8. Workflow management for different tasks.

Workflow tasks

Integro automates activities related to the task life cycle. Simplified, under the task life cycle we will understand the workflow of tasks in Jira. There are several workflow variations in our development processes depending on the project, issue type, and options selected on a particular issue. 

Consider the workflow that we use most often:

From scripts to our own platform: how we automated development at CIAN

In the diagram, the gear indicates that the transition is called by Integro automatically, while the human figure means that the transition is called manually by a person. Let's look at several paths that a task can take in this workflow.

Fully manual testing on DEV + BETA without canary tests (usually we release a monolith this way):

From scripts to our own platform: how we automated development at CIAN

There may be other transition combinations. Sometimes the path that an issue will take can be chosen through options in Jira.

Task movement

Consider the main steps that are performed when a task moves along the workflow "Testing for DEV + canary tests":

1. Developer or PM creates a task.

2. The developer takes the task to work. After completion, transfers it to the IN REVIEW status.

3. Jira sends a Webhook towards the Jira microservice (responsible for integration with Jira).

4. The Jira microservice sends a request to the Flow service (responsible for the internal workflows in which work is performed) to start the workflow.

5. Inside the Flow service:

  • Reviewers are assigned for the task (Users microservice that knows everything about users + Jira microservice).
  • Through the Source microservice (it knows about repositories and branches, but does not work with the code itself), it searches for repositories that have a branch of our task (to simplify the search, the branch name matches the task number in Jira). Most often, the task has only one branch in one repository, this simplifies the management of the queue for deployment and reduces the connectivity between repositories.
  • For each found branch, the following sequence of actions is performed:

    i) Adding a master branch (a Git microservice for working with code).
    ii) The branch is blocked from changes by the developer (Bitbucket microservice).
    iii) Create a Pull Request for this branch (Bitbucket microservice).
    iv) A message about a new Pull Request is sent to developers' chats (Notify microservice for working with notifications).
    v) The assembly, testing and deployment of the task on DEV (Jenkins microservice for working with Jenkins) are started.
    vi) If all the previous points were successful, then Integro puts its Approve in a Pull Request (Bitbucket microservice).

  • Integro is awaiting Approve in a Pull Request from designated reviewers.
  • As soon as all the necessary Approve is received (including automated tests passed positively), Integro transfers the task to the Test on Dev (Jira microservice) status.

6. Testers test the task. If there are no problems, then the task is transferred to the Ready For Build status.

7. Integro "sees" that the task is ready for release, and launches its deployment in canary mode (Jenkins microservice). Release readiness is determined by a set of rules. For example, the task is in the required status, there are no locks on other tasks, there are no active calculations for this microservice, etc.

8. The task is transferred to the status of Canary (Jira microservice).

9. Jenkins launches deploy tasks in canary mode via Nomad (usually 1-3 instances) and notifies the release monitoring service (DeployWatch microservice) about the deployment.

10. DeployWatch microservice collects error background and responds to it if necessary. When the error background is exceeded (the background rate is calculated automatically), developers are notified via the Notify microservice. If after 5 minutes the developer has not reacted (pressed Revert or Stay), then automatic rollback of canary instances is launched. If the background is not exceeded, then the developer must manually launch the task deployment to Production (by pressing a button in the UI). If within 60 minutes the developer has not launched a deployment to Production, then the canary instances will also be rolled back for security reasons.

11. After launching the deployment on Production:

  • The task is transferred to the Production status (Jira microservice).
  • The Jenkins microservice starts the deployment process and notifies the deployment of the DeployWatch microservice.
  • The DeployWatch microservice checks that all containers have been updated on Production (there were cases when not all were updated).
  • Through the Notify microservice, a notification about the results of the deployment is sent to Production.

12. Developers will have 30 minutes to start the rollback of the task from Production in case of detection of incorrect behavior of the microservice. After this time, the task will be automatically merged into master (Git microservice).

13. After a successful merge in master, the task status will be changed to Closed (Jira microservice).

The scheme does not claim to be completely detailed (in reality, there are even more steps), but it allows you to assess the degree of integration into processes. We do not consider this scheme ideal and we are improving the processes of automatic release and deployment support.

What's next

We have big plans for the development of automation, for example, eliminating manual operations during monolith releases, improving monitoring during automatic deployment, and improving interaction with developers.

But let's stop here for now. We covered many topics in the automation review superficially, some were not touched at all, so we will be happy to answer questions. We are waiting for suggestions on what to cover in detail, write in the comments.

Source: habr.com

Add a comment