The problem of "smart" cleaning of container images and its solution in werf

The problem of "smart" cleaning of container images and its solution in werf

The article deals with the problem of cleaning images that accumulate in container registries (Docker Registry and its analogues) in the realities of modern CI / CD pipelines for cloud native applications delivered to Kubernetes. The main criteria for the relevance of images and the resulting difficulties in automating cleaning, saving space and meeting the needs of teams are given. Finally, using the example of a specific Open Source project, we will describe how these difficulties can be overcome.

Introduction

The number of images in a container registry can grow exponentially, taking up more storage space and thus significantly increasing storage costs. In order to control, limit or maintain an acceptable growth of space occupied in the registry, it is customary to:

  1. use a fixed number of tags for images;
  2. somehow clean up the images.


The first limitation is sometimes valid for small teams. If developers have enough persistent tags (latest, main, test, boris etc.), the registry will not swell in size and for a long time you can not think about cleaning at all. After all, all irrelevant images are frayed, and there is simply no work left for cleaning (everything is done by a regular garbage collector).

However, this approach severely limits development and is rarely applicable to modern CI/CD projects. An integral part of the development was automation, which allows you to test, deploy and deliver new functionality to users much faster. For example, in all our projects, a CI pipeline is automatically created with each commit. An image is assembled in it, tested, rolled out to various Kubernetes circuits for debugging and the remaining checks, and if everything is fine, the changes reach the end user. And this is not rocket science for a long time, but commonplace for many - most likely for you, since you are reading this article.

Since bug fixes and new feature development are carried out in parallel, and releases can be performed several times a day, it is obvious that the development process is accompanied by a significant number of commits, which means - a large number of images in the registry. As a result, the issue of organizing effective cleaning of the registry, i.e. removing irrelevant images.

But how do you determine whether an image is relevant?

Criteria for the relevance of the image

In the vast majority of cases, the main criteria will be as follows:

1. The first (the most obvious and most critical of all) are the images that currently used in Kubernetes. Deleting these images can result in significant production downtime costs (for example, the images may be required for replication) or nullify the efforts of a team that is debugging on any of the loops. (For this reason, we even made a special Prometheus exporters, which keeps track of the absence of such images in any Kubernetes cluster.)

2. The second (less obvious, but also very important and again related to exploitation) are images that required for rollback in case of detection of serious problems in the current version. For example, in the case of Helm, these are the images that are used in the saved versions of the release. (By the way, by default Helm has a limit of 256 revisions, but hardly anyone really needs to save this a large number of versions? β€œroll back” on them if necessary.

3. Third - developer needs: all skins that are related to their current work. For example, if we are considering PR, then it makes sense to leave an image corresponding to the last commit and, say, the previous commit: this way the developer can quickly return to any task and work with the latest changes.

4. Fourth - images that match the versions of our application, i.e. are the final product: v1.0.0, 20.04.01, sierra, etc.

NB: The criteria defined here were formulated based on experience interacting with dozens of development teams from different companies. However, of course, depending on the specifics in the development processes and the infrastructure used (for example, Kubernetes is not used), these criteria may differ.

Eligibility and Existing Solutions

Popular services with a container registry, as a rule, offer their own image cleanup policies: in them you can define the conditions under which a tag is removed from the registry. However, these conditions are limited by parameters such as names, creation time, and number of tags*.

* Depends on specific container registry implementations. We considered the possibilities of the following solutions: Azure CR, Docker Hub, ECR, GCR, GitHub Packages, GitLab Container Registry, Harbor Registry, JFrog Artifactory, Quay.io - as of September'2020.

Such a set of parameters is quite enough to satisfy the fourth criterion - that is, to select images that correspond to versions. However, for all other criteria, one has to choose some kind of compromise solution (a tougher or, conversely, a sparing policy) depending on expectations and financial capabilities.

For example, the third criterion - related to the needs of developers - can be solved by organizing processes within teams: specific naming of images, maintaining special allow lists and internal agreements. But ultimately it still needs to be automated. And if the possibilities of ready-made solutions are not enough, you have to do something of your own.

The situation with the first two criteria is similar: they cannot be satisfied without receiving data from an external system - the one where applications are deployed (in our case, this is Kubernetes).

Git workflow illustration

Let's say you work like this in Git:

The problem of "smart" cleaning of container images and its solution in werf

The icon with a head in the diagram marks container images that are currently deployed in Kubernetes for any users (end users, testers, managers, etc.) or used by developers for debugging and similar purposes.

What happens if cleanup policies allow you to keep (not delete) images only by given tag names?

The problem of "smart" cleaning of container images and its solution in werf

Obviously, such a scenario will not please anyone.

What will change if policies allow not deleting images by given time interval / number of recent commits?

The problem of "smart" cleaning of container images and its solution in werf

The result has become much better, but still far from ideal. After all, we still have developers who need images in the registry (or even deployed in K8s) to debug bugs ...

To summarize the current market situation, the functions available in container registries do not offer enough flexibility in cleanup, and the main reason for this is unable to interact with the outside world. It turns out that teams that need this flexibility are forced to implement the removal of images "outside" themselves, using the Docker Registry API (or the native API of the corresponding implementation).

However, we were looking for a universal solution that would automate image cleanup for different teams using different registries…

Our Path to Universal Image Cleanup

Where does such a need come from? The fact is that we are not a separate group of developers, but a team that serves many of them at once, helping to comprehensively resolve CI / CD issues. And the main technical tool for this is the Open Source utility yard. Its peculiarity is that it does not perform a single function, but accompanies continuous delivery processes at all stages: from assembly to deployment.

Publishing images to the registry* (immediately after they are built) is an obvious function of such a utility. And since the images are placed there for storage, then - if your storage is not unlimited - you need to be responsible for their subsequent cleaning. How we succeeded in this, satisfying all the given criteria, will be discussed further.

* Although the registries themselves may be different (Docker Registry, GitLab Container Registry, Harbor, etc.), their users face the same problems. The universal solution in our case does not depend on the implementation of the registry, because runs outside the registries themselves and offers the same behavior for all.

Although we are using werf as an implementation example, we hope that the approaches used will be useful to other teams facing similar difficulties.

So we got busy external the implementation of a mechanism for cleaning images - instead of the features that are already built into the registries for containers. The first step was to use the Docker Registry API to create the same primitive policies for the number of tags and the time they were created (mentioned above). To them was added allow list based on images used in deployed infrastructure, i.e. Kubernetes. For the latter, it was enough to go through all the deployed resources through the Kubernetes API and get a list of values image.

Such a trivial solution closed the most critical problem (criterion # 1), but was only the beginning of our journey to improve the cleaning mechanism. The next - and much more interesting - step was the decision link published images to Git history.

Tagging schemes

To begin with, we chose an approach in which the final image should store the necessary information for cleaning, and built the process around tagging schemes. When publishing an image, the user selected a specific tagging option (git-branch, git-commit or git-tag) and used the appropriate value. In CI systems, these values ​​were set automatically based on environment variables. In fact the target image was associated with a specific Git primitive, storing the necessary data for cleaning in the labels.

As part of this approach, a set of policies was obtained that allowed Git to be used as the only source of truth:

  • When deleting a branch/tag in Git, the associated images in the registry were automatically deleted as well.
  • The number of images associated with Git tags and commits could be controlled by the number of tags used in the selected schema and the time of the associated commit.

In general, the resulting implementation satisfied our needs, but soon a new challenge awaited us. The fact is that during the use of tagging schemes for Git primitives, we encountered a number of shortcomings. (Since their description is beyond the scope of this article, everyone can read the details here.) Therefore, having made the decision to switch to a more efficient approach to tagging (content-based tagging), we had to revise the implementation of image cleaning as well.

New algorithm

Why? When tagging in a content-based framework, each tag can satisfy multiple Git commits. Cleaning images can no longer originate only from the commit where the new tag was added to the registry.

For the new cleaning algorithm, it was decided to move away from tagging schemes and build process on meta-images, each of which stores a bunch of:

  • the commit on which the publication was performed (it does not matter whether the image was added, changed or remained the same in the container registry);
  • and our internal identifier corresponding to the built image.

In other words, provided linking published tags to commits in Git.

Final configuration and general algorithm

When configuring cleaning, users have access to policies that select actual images. Each such policy is defined by:

  • many references, i.e. Git tags or Git branches that are used in the scan;
  • and the limit of the desired images for each reference from the set.

To illustrate, here is what the default policy configuration looks like:

cleanup:
  keepPolicies:
  - references:
      tag: /.*/
      limit:
        last: 10
  - references:
      branch: /.*/
      limit:
        last: 10
        in: 168h
        operator: And
    imagesPerReference:
      last: 2
      in: 168h
      operator: And
  - references:  
      branch: /^(main|staging|production)$/
    imagesPerReference:
      last: 10

This configuration contains three policies that comply with the following rules:

  1. Save the image for the last 10 Git tags (based on the date the tag was created).
  2. Save up to 2 images published in the last week for up to 10 branches with activity in the last week.
  3. Save 10 images for branches main, staging ΠΈ production.

The final algorithm boils down to the following steps:

  • Getting manifests from the container registry.
  • Exclusion of images used in Kubernetes, because we have already pre-selected them by polling the K8s API.
  • Scanning Git history and exclusion of images according to specified policies.
  • Removing the remaining images.

Returning to our illustration, this is what happens with werf:

The problem of "smart" cleaning of container images and its solution in werf

However, even if you don't use werf, a similar approach to advanced image cleanup - in one implementation or another (according to the preferred approach to image tagging) - can be applied in other systems/utilities. To do this, it is enough to be aware of the problems that arise and find those opportunities in your stack that allow you to build their solution most smoothly. We hope that the path we have traveled will help you look at your particular case with new details and thoughts.

Conclusion

  • Sooner or later, most teams face the registry overflow problem.
  • When looking for solutions, first of all, it is necessary to determine the criteria for the relevance of the image.
  • The tools offered by popular container registry services allow you to organize a very simple cleanup that does not take into account the "outside world": the images used in Kubernetes and the specifics of team workflows.
  • A flexible and efficient algorithm should have an understanding of CI / CD processes, operate not only with Docker image data.

PS

Read also on our blog:

Source: habr.com

Add a comment