4 engineers, 7000 servers and one global pandemic

Hey Habr! I present to your attention the translation of the article "4 Engineers, 7000 Servers, And One Global Pandemic" by Adib Daw.

If this headline didn't send a slight chill down your spine, you should skip to the next paragraph or go to our page on career in the company - We'd like to talk.

Who are we

We are a team of 4 penguins who love to code and work with hardware. In our spare time, we are responsible for deploying, maintaining and operating a fleet of over 7000 physical Linux servers distributed across 3 different data centers across the US.

We have also been able to do this from 10 km away from the sites, from the comfort of our own office, which is located a short drive from the Mediterranean beach.

Scale problems

While it makes sense for a startup to start by hosting their infrastructure in the cloud due to the relatively small initial investment, we at Outbrain have chosen to use our own servers. We did this because the cost of cloud infrastructure is much higher than the cost of operating our own equipment housed in data centers after evolving to a certain level. In addition, its server provides the highest degree of control and troubleshooting capabilities.

As we develop, problems are always nearby. And they usually come in groups. Server lifecycle management requires constant self-improvement in order to be able to work well in the face of a rapidly increasing number of servers. Programmatic methods for managing server groups in data centers become unwieldy very quickly. Finding, troubleshooting, and repairing failures while meeting the standards of a quality-of-service contract turns into manipulating extremely diverse hardware arrays, different workloads, upgrade times, and other nice things that no one wants to care about.

Master Your Domains

To address many of these issues, we've broken down the Outbrain server life cycle into its major components and called them domains. For example, one domain covers the need for equipment, another for logistics associated with the inventory life cycle, and a third for communication with on-site personnel. There is one more, concerning the hardware observability, but we will not describe all the moments. Our goal was to study and define domains so that they can be abstracted with code. Once a working abstraction is developed, it is translated into a manual process that is deployed, tested, and refined. Finally, the domain is configured to integrate with other domains via an API, forming a cohesive, dynamic, and ever-evolving hardware lifecycle system that can be deployed, tested, and observed. Just like all of our other production systems.

Adopting this approach has allowed us to properly solve many problems - by creating tools and automation tools.

Needs domain

While email and spreadsheets were an acceptable way to meet demand in the early days, they weren't a great solution, especially when the number of servers and the volume of incoming requests reached a certain level. To better organize and prioritize incoming requests in a rapidly expanding environment, we had to use a ticketing system that could offer:

  • Ability to customize the view of only relevant fields (simple)
  • Open APIs (extensible)
  • Known to our team (understandable)
  • Integration with our existing workflows (unified)

Since we use Jira to manage our sprints and internal tasks, we decided to create another project that would help our clients submit tickets and track their results. Using Jira for incoming requests and for managing internal tasks allowed us to create a single Kanban board that allowed us to look at all processes as a whole. Our internal "customers" only saw requests for equipment, not delving into the less significant details of additional tasks (such as improving tools, fixing bugs).

4 engineers, 7000 servers and one global pandemic
Kanban board in Jira

As a bonus, the fact that queues and priorities were now visible to everyone made it possible to understand "where in the queue" a particular request was and what preceded it. This allowed owners to reprioritize their own requests without having to contact us. Drag - and all business. It also allowed us to monitor and evaluate our SLAs according to request types based on the metrics generated in Jira.

Hardware Lifecycle Domain

Try to imagine the complexity of managing the hardware used in each server rack. Even worse, many pieces of iron (RAM, ROM) can be moved from a warehouse to a server room and vice versa. And they also fail or are decommissioned and replaced, returned to the supplier for replacement / repair. All this must be reported to the employees of the colocation service involved in the physical maintenance of the equipment. To solve these problems, we created an internal tool called Floppy. Its task is:

  • Management of communication with field personnel, aggregation of all information;
  • Updating the "warehouse" data after each performed and verified maintenance work on the equipment.

The warehouse, in turn, is visualized using Grafana, which we use to plot all of our metrics. Thus, we use the same tool for warehouse visualization and for other production needs.

4 engineers, 7000 servers and one global pandemicWarehouse equipment control panel in Grafana

For server devices that are under warranty, we use a different tool, which we call the Dispatcher. He:

  • Collects system logs;
  • Generates reports in the format required by the vendor;
  • Starts a request with the vendor through the API;
  • Gets and stores the ID of the request for further tracking of its progress.

After our claim is accepted (usually within a business day), the spare part is sent to the appropriate data center and accepted by the staff.

4 engineers, 7000 servers and one global pandemic
Jenkins console output

Communication domain

In order to keep pace with the rapid growth of our business, which requires more and more capacity, we had to adapt the way we work with technical specialists of local data centers. If at first scaling meant buying new servers, then after the consolidation project (based on moving to Kubernetes) it became something completely different. Our evolution from "adding racks" to "repurposing servers".

Using a new approach also required new tools that allow you to interact more comfortably with data center personnel. These tools required:

  • Simplicity;
  • autonomy;
  • Efficiency;
  • Reliability.

We had to exclude ourselves from the chain and structure the work so that the technicians could work directly with the server hardware. Without our intervention and without regularly raising all these issues regarding workload, working hours, availability of equipment, etc.

To achieve this, we installed an iPad in every data center. After connecting to the server, the following will happen:

  • The device confirms that this server really needs some work;
  • Applications running on the server are closed (if necessary);
  • A set of work instructions is posted on a Slack channel explaining the required steps;
  • Upon completion of the work, the device checks the correctness of the final state of the server;
  • Restarts applications if necessary.

In addition, we also prepared a Slack bot to help the technician. Thanks to a wide range of features (we were constantly expanding the functionality), the bot made their work easier, and it made life much easier for us. This is how we streamlined much of the server repurposing and maintenance process by removing ourselves from the workflow.

4 engineers, 7000 servers and one global pandemic
iPad in one of our data centers

Hardware domain

Reliable scaling of our data center infrastructure requires good visibility of each component, for example:

  • Hardware failure detection
  • Server states (active, hosted, zombies, etc.)
  • Power Consumption
  • Firmware version
  • Analytics across this economy

Our solutions allow us to make decisions about how, where and when to purchase equipment, sometimes even before the need actually arises. Also, by determining the level of loads on different equipment, we were able to improve the distribution of resources. In particular, energy consumption. Now we can make informed decisions about server placement, before it is rack mounted and connected to power, and throughout its lifecycle, right up to its eventual retirement.

4 engineers, 7000 servers and one global pandemic
Energy panel in Grafana

And then COVID-19 came along...

Our team creates technologies that empower media companies and publishers online, helping visitors find relevant content, products and services that may be of interest to them. Our infrastructure is designed to serve the traffic generated when some breaking news is released.

Heavy media coverage of COVID-19, coupled with the increase in traffic, meant that we urgently needed to learn how to deal with such pressures. In addition, all this had to be done in a global crisis, when supply chains were disrupted, and most of the staff was at home.

But, as we said, our model already assumes that:

  • The equipment in our data centers is, for the most part, not physically accessible to us;
  •  We do almost all physical work remotely;
  • Work is performed asynchronously, autonomously and in large volume;
  • We meet the demand for equipment by "assembly from parts" instead of buying new equipment;
  • We have a warehouse that allows us to create something new, and not just perform current repairs.

Thus, the global restrictions that prevented many companies from gaining physical access to their data centers had little effect on us. As for parts and servers, yes, we tried to ensure the stable operation of the equipment. But it was done in order to prevent possible incidents, when it suddenly turns out that some piece of hardware is not available. We ensured the filling of our reserves, not aiming to meet the current demand.

In conclusion, I would like to say that our approach to working in the data center industry proves that it is possible to apply the principles of good code design to the processes of physical management of the data center. And you might find it interesting.

Original: tyts

Source: habr.com

Add a comment