How to withstand increased loads on the system: we talk about large-scale preparations for Black Friday

Hey Habr!

In 2017, during Black Friday, loads increased by almost one and a half times, and our servers were at the limit of their capabilities. Over the year, the number of clients has grown significantly, and it became clear that without careful preliminary preparation, the platform may simply not withstand the loads of 2018.

We set the most ambitious goal possible: we wanted to be fully prepared for any, even the most powerful, bursts of activity and began to roll out new capacities in advance during the year.

Our CTO Andrey Chizh (chizh_andrey) tells how we prepared for Black Friday 2018, what measures we took to avoid falls, and, of course, about the results of such thorough preparation.

How to withstand increased loads on the system: we talk about large-scale preparations for Black Friday

Today I want to talk about preparations for Black Friday 2018. Why now, when most of the major sales are over? We started preparing about a year before large-scale campaigns, and by trial and error we found the best solution. We also recommend that you take care of the hot seasons in advance and prevent fakups that can pop up at the most inopportune moment.
The material will be useful to everyone who wants to squeeze the maximum profit from such promotions, because. the technical side of the issue is not inferior here to the marketing one.

Features of traffic on big sales

Contrary to popular belief, Black Friday is not just one day a year, but almost a whole week: the first discount offers are received 7-8 days before the sale. Site traffic begins to rise smoothly throughout the week, reaches its peak on Friday and drops quite sharply on Saturday to the store's regular indicators.

How to withstand increased loads on the system: we talk about large-scale preparations for Black Friday

This is important to consider: online stores are becoming especially sensitive to any “slowdowns” in the system. In addition, our email marketing line also experienced a significant increase in the number of sends.

It is strategically important for us to go through Black Friday without falls, because. The most important functionality of websites and store newsletters depends on the operation of the platform, namely:

  • Tracking and issuing product recommendations,
  • Issuance of related materials (for example, design images of recommendation blocks, such as arrows, logos, icons and other visual elements),
  • Issuance of product images of the required size (for these purposes, we have an "ImageResizer" - a subsystem that downloads an image from the store's server, compresses it to the desired size, and outputs images of the required size for each product in each recommendation block through caching servers).

In fact, during Black Friday 2019, the load on the service increased by 40%, i.e. the number of events monitored and processed by the Retail Rocket system on online store sites has grown from 5 to 8 thousand requests per second. Due to the fact that we were preparing for more serious loads, we survived such a surge easily.

How to withstand increased loads on the system: we talk about large-scale preparations for Black Friday

General training

Black Friday is a hot time for all retail and ecommerce in particular. The number of users and their activity at this time is growing at times, so we, as always, thoroughly prepared for this busy time. Let's add here the fact that many online stores are connected to us not only in Russia, but also in Europe, where the excitement is much higher, and we get a level of passion worse than the Brazilian series. What needs to be done to be fully prepared for increased loads?

Working with servers

To begin with, it was necessary to find out what exactly we lack to increase the capacity of the servers. Already in August, we started ordering new servers specifically for Black Friday - in total, we added 10 additional machines. By November, they were already fully in combat.

At the same time, some of the build machines were reinstalled for use as Application servers. We immediately prepared them for using different functions: both for issuing recommendations and for the ImageResizer service, so that, depending on the type of load, each of them could be used for one of these roles. In normal mode, the Application and ImageResizer servers have clearly defined functions: the former are engaged in issuing recommendations, the latter supply images for emails and recommendation blocks on the online store website. In preparation for Black Friday, it was decided to make all servers dual-purpose in order to balance traffic between them depending on the type of download.

Then we added two large servers for Kafka (Apache Kafka) and got a cluster of 5 powerful machines. Unfortunately, everything did not go as smoothly as we would like: in the process of data synchronization, two new machines occupied the entire width of the network channel, and we had to urgently figure out how to carry out the addition process quickly and safely for the entire infrastructure. To resolve this issue, our administrators had to valiantly sacrifice days off.

Working with data

In addition to the servers, we decided to optimize the files to lighten the load, and the translation of static files was a big step for us. All static files that were previously hosted on servers were moved to S3 + Cloudfront. We have been wanting to do this for a long time, since the load on the server was close to the limit values, and now a great reason has appeared.

A week before Black Friday, the image caching time was increased to 3 days, so that in the event of an ImageResizer crash, previously cached images would be obtained from cdn. It also reduced the load on our servers, because the longer the image is stored, the less often we need to spend resources on resizing.

Last but not least, 5 days before Black Friday, a moratorium was announced on the deployment of any new functionality, as well as on any infrastructure work - all attention is directed to coping with increased loads.

Response Plans for Difficult Situations

No matter how good the preparation is, fuck-ups are always possible. And we have developed 3 response plans for possible critical situations:

  • load reduction,
  • disabling some services
  • complete shutdown of the service.

Plan A: Reduce load. It should have been enabled if, due to a surge in load, our servers go beyond the allowable response timings. In this case, we have prepared mechanisms for gradually reducing the load by switching part of the traffic to the Amazon servers, which would simply return “200 OK” to all requests and give an empty response. We understood that this was a degradation of the quality of the service, but the choice between the fact that the service does not work at all or does not show recommendations for about 10% of traffic is obvious.

Plan B: Disable services. I meant a partial degradation of the service. For example, reducing the speed of calculating personal recommendations for the sake of unloading some databases and communication channels. In normal mode, recommendations are calculated in real-time mode, creating a version of the online store for each visitor, but under conditions of increased loads, a decrease in speed allows other core services to continue working.

Plan C: in case of Armageddon. In the event of a total system failure, we have prepared a plan to safely disconnect us from customers. Store buyers will simply stop seeing recommendations, and the performance of the online store will not suffer in any way. To do this, we would have to reset our integration file so that new users would stop interacting with the service. That is, we would disable the work of our main tracking code, the service would stop collecting data and calculate recommendations, and the user would simply see the page without recommendation blocks. For all those who have already received the integration file, we have provided the option of switching the DNS record to Amazon and the 200 OK stub.

Results

We coped with the entire load even without the need to use additional build machines. And thanks to advance preparation, we did not need any of the developed response plans. But all the work done is an invaluable experience that will help us cope with the most unexpected and huge influxes of traffic.
As in 2017, the load on the service increased by 40%, and the number of users in online stores increased by 60% during Black Friday. All difficulties and mistakes occurred during the preparatory period, which saved us and our clients from unforeseen situations.

How are you experiencing Black Friday? How do you prepare for critical workloads?

Source: habr.com

Add a comment