What helped us to quickly adapt to online trading in the new conditions

Hi!

My name is Mikhail, I'm Deputy IT Director at Sportmaster. I want to share a story about how we coped with the difficulties that arose during the pandemic.

In the early days of new realities, the usual offline trading format of Sportmaster stopped, and the load on our online channel, primarily in terms of delivery to the client’s address, increased 10 times. In a few weeks, we have transformed a gigantic offline business into an online one, and adapted the service to the needs of our clients.

In short, what was essentially our side operation became our core business. The importance of every online order has grown exponentially. It was necessary to save every ruble that the client brought to the company. 

What helped us to quickly adapt to online trading in the new conditions

In order to quickly respond to customer requests, we opened an additional contact center at the company's main office, and now we can receive about 285 calls per week. At the same time, we transitioned 270 stores to a new contactless and secure work format, which allowed customers to receive orders and employees to keep their jobs.

In the process of transformation, we encountered two main problems. Firstly, the load on our online resources has noticeably increased (Sergey will tell you how we dealt with this). Secondly, the flow of rare (pre-COVID) operations has increased many times over, which in turn required a large amount of rapid automation. To solve this problem, we had to quickly transfer resources from areas that were previously the main ones. How we coped with this - Elena will tell.

Operation of online services

Kolesnikov Sergey, responsible for the operation of the online store and microservices

From the moment our retail stores began to close to visitors, we began to record an increase in such metrics as the number of users, the number of orders placed in our application, the number of requests to applications. 

What helped us to quickly adapt to online trading in the new conditionsNumber of orders from 18 to 31 MarchWhat helped us to quickly adapt to online trading in the new conditionsNumber of requests to online payment microservicesWhat helped us to quickly adapt to online trading in the new conditionsThe number of orders placed on the site

On the first graph, we see that the increase was about 14 times, on the second - 4 times. In this case, we consider the metric of the response time of our applications to be the most indicative. 

What helped us to quickly adapt to online trading in the new conditions

On this graph, we see the response of fronts and applications, and for ourselves we determined that we did not notice growth as such.

This is primarily due to the fact that we started preparatory work at the end of 2019. Now our services are reserved, fault tolerance is provided at the level of physical servers, virtualization systems, dockers, and services in them. At the same time, the capacity of our server resources allows us to survive multiple loads.

The main tool that helped us in this whole story was our monitoring system. True, until quite recently we did not have a single system that would allow us to collect metrics on all layers, from the level of physical equipment and hardware to the level of business metrics. 

Formally, there was monitoring in the company, but as a rule it was dispersed and was in the area of ​​responsibility of specific departments. In fact, when an incident occurred, we almost never had a common understanding of what exactly happened, there was no information, and often this led to running in circles in search and localization of the problem for subsequent elimination.

At some point, we thought about it and decided that it was enough to endure it - we need a single system to see the whole picture in full. The main technologies that are included in our stack are Zabbix as an alerting center and storing metrics, Prometheus for collecting and storing application metrics, Stack ELK for logging and storing data of the entire monitoring system, as well as Grafana for visualization, Swagger, Docker and other useful and things you know.

At the same time, we use not only technologies available on the market, but also develop something on our own. For example, we make services for integrating systems with each other, that is, some kind of API for collecting metrics. Plus, we are working on our own monitoring systems - at the level of business metrics, we use UI tests. As well as a bot in Telegram to notify teams.

We are also trying to make the monitoring system accessible to teams so that they can store their own metrics and work with them, including setting up alerts for some narrow metrics that are not of the widest application. 

Throughout the system, we strive for proactivity and the fastest possible localization of incidents. In addition, the number of our microservices and systems has grown significantly in recent years, and the number of integrations is growing accordingly. And as part of optimizing the process of diagnosing incidents at the integration level, we are developing a system that allows you to conduct cross-system checks and display the result, which allows you to find the main problems associated with imports and the interaction of systems with each other. 

Of course, we still have room to grow and develop in terms of systems operation, and we are actively working on this. You can read more about our monitoring system here

Technical tests 

Orlov Sergey, Head of Competence Center for Web and Mobile Development

Since the physical store closures began, we have had to face various challenges in terms of development. First of all, the load jump as such. It is clear that if appropriate measures are not taken, then when a high load is applied to the system, it can turn into a pumpkin with a sad pop, or completely degrade in performance, or even lose its performance.

The second aspect, a little less obvious, is that the system under high load had to be changed very quickly, adapting to changes in business processes. Sometimes several times a day. In many companies, there is a rule that during large marketing activities it is not necessary to make any changes to the system. None at all, let it work, once it works.

And we essentially had an endless Black Friday, on which we had to change the system. And any mistake, problem, failure in the system would be very expensive for the business.

Looking ahead, I’ll say that we managed to cope with these tests, all systems withstood the load, easily scaled up and we didn’t have any global technical failures.

There are four pillars on which the system's ability to withstand high transient loads rests. The first of these is monitoring, which you read about a little higher. Without a built-in monitoring system, it is almost impossible to find bottlenecks in the system. A good monitoring system is like home clothes, it should be comfortable and tailored to you.

The second aspect is testing. We take this moment very seriously: we also write classic units, integration, load tests and many others for each system. We are also writing a testing strategy, and at the same time we are trying to bring the level of testing to the point that we no longer need manual checks.

The third whale is the CI/CD Pipeline. The processes of building, testing, deploying the application should be automated as much as possible, there should be no manual interventions here. The topic of CI/CD Pipeline is quite deep, and I will only touch on it in passing. It is only worth mentioning that we have a CI / CD Pipeline checklist, which every product team goes through with the help of competence centers.

What helped us to quickly adapt to online trading in the new conditionsAnd here is the checklist

In this way, many goals are achieved. This is API versioning, and a feature toggle to avoid a steam locomotive of releases, and achieving coverage of different tests at such a level that testing is fully automated, deployments are seamless, and so on.

The fourth whale is architectural principles and technical solutions. You can talk a lot about architecture for a long time, but I want to emphasize a couple of principles that I would like to focus on.

First, you need to choose specialized tools for specific tasks. Yes, it sounds obvious, and it is clear that nails should be hammered in with a hammer, and watches should be disassembled with special screwdrivers. But in our age, many tools tend to be universal in order to cover the maximum segment of users: databases, caches, frameworks and the rest. For example, if you take the MongoDB database, it works with multi-document transactions, and the Oracle database works with jsons. And it would seem that everything can be used for everything. But if we stand up for performance, then we need to clearly understand the strengths and weaknesses of each tool and use the ones we need for our class of tasks. 

Second, when designing systems, each increase in complexity must be justified. We must constantly keep this in mind, the principle of low coupling is known to everyone. I believe that it should be applied at the level of a specific service, and at the level of the entire system, and at the level of the architectural landscape. Also important is the ability to horizontally scale each component of the system along the load path. If you have this ability, scaling will not be difficult.

Speaking of technical solutions, we asked the product teams to come up with a fresh set of recommendations, ideas and solutions that they implemented in preparation for the next load wave.

Keshi

It is necessary to consciously approach the choice of local and distributed caches. Sometimes it makes sense to use both of them within the same system. For example, we have systems in which part of the data is essentially a storefront cache, that is, the update source is located behind the system itself, and the system does not change this data. For this approach, we use a local Caffeine Cache. 

And there is data that the system actively changes during operation, and here we are already using a distributed cache with Hazelcast. This approach allows us to use the benefits of a distributed cache where we really need it, and minimize the service costs of circulating the data of a Hazelcast cluster where we can do without it. We wrote a lot about caches here и here.

In addition, changing the serializer to Kryo in Hazelcast gave us a nice boost. And the transition from ReplicatedMap to IMap + Near Cache in Hazelcast allowed us to minimize the movement of data across the cluster. 

A word of advice: in case of mass invalidation of the cache, the tactics of warming up the second cache and then switching to it are sometimes applicable. It would seem that with this approach, we should get double memory consumption, but in practice, in those systems where this was practiced, memory consumption decreased.

jet stack

We use the reactive stack in a fairly large number of systems already. In our case, this is Webflux or Kotlin with coroutines. The reactive stack is especially good where we expect slow input-output operations. For example, calls to slow services, work with the file system or storage systems.

The most important principle is to avoid blocking calls. Reactive frameworks have a small number of live service threads running under the hood. If we inadvertently allow ourselves to make a direct blocking call, such as a call to the JDBC driver, then the system will simply halt. 

Try to turn errors into your own runtime exception. The real flow of program execution moves away to reactive frameworks, code execution becomes non-linear. As a result, it is very difficult to diagnose problems from stack traces. And the solution here is to create understandable objective runtime exceptions for each error.

Elasticsearch

When using Elasticsearch, do not select unused data. This, in principle, is also very simple advice, but most often it is about this that is forgotten. If you need to select more than 10 thousand records at a time, you need to use Scroll. In an analogy, this is a bit like a cursor in a relational database. 

Don't use postfilter unnecessarily. With large data in the main sample, this operation loads the database very heavily. 

Use bulk operations where applicable.

API

When designing an API, lay down the requirements for minimizing the transmitted data. This is especially true in connection with the front: it is at this junction that we go beyond the channels of our data centers and are already working on the channel that connects us with the client. If it has the slightest problem, too much traffic causes a negative user experience.

And, finally, don't throw out the whole pile of data, be clear about the contract between consumers and suppliers.

Organizational transformation

Eroshkina Elena, Deputy IT Director

At the moment when the quarantine happened, and it became necessary to dramatically increase the pace of online development and implement omnichannel services, we were already in the process of organizational transformation. 

Part of our structure was transferred to work on the principles and practices of the product approach. Teams have been formed that are now responsible for the operation and development of each product. Employees in such teams are 100% involved and build their work on Scrum or Kanban, depending on what they prefer, set up a deployment pipeline, implement technical practices, quality assurance practices, and much more.

Luckily, most of our product teams were in the field of online and omnichannel services. This allowed us in the shortest possible time (seriously, literally in two days) to switch to remote work without loss of efficiency. The customized process made it possible to quickly adapt to new working conditions and maintain a fairly high rate of delivery of new functionality.

In addition, we have a need to strengthen those teams that are on the frontier of online business. At that moment, it became clear that we could only do this with internal resources. And about 50 people in two weeks changed the area where they worked before, and integrated into the work on a new product for them. 

This did not require any special management efforts, because along with the organization of our own process, with the technical improvement of the product, with the practice of quality assurance, we teach our teams to self-organize - to manage their own production process without involving administrative resources.

We were able to focus the management resource exactly where it was needed at that moment - on coordination together with the business: What is important for our client right now, what functionality should be implemented in the first place, what needs to be done to increase our throughput ability to deliver and process orders. All this and a clear role model made it possible during this period to load our production value streams with what is really important and necessary. 

It is clear that with remote work and a high pace of change, when business indicators depend on the participation of everyone, one cannot rely only on internal feelings from the series “Are we doing well? Yes, it looks good." Objective metrics of the manufacturing process are needed. We have these, they are available to anyone who is interested in the metrics of product teams. First of all, the team itself, business, subcontractors and management.

Once every two weeks, a status is held with each team, where metrics are analyzed for 10 minutes, bottlenecks in the production process are identified and a joint decision is made: what can be done to eliminate these bottlenecks. Here you can immediately ask for help from the management in case some identified problem is outside the zone of influence of the teams, or the expertise of colleagues who may have already encountered a similar problem.

Nevertheless, we understand that for a multiple acceleration (and this is precisely the goal we set for ourselves), we still need to learn a lot and implement it into our daily work. Right now, we are continuing to scale our product approach to other teams and new products. To do this, we had to master the new format of the online school of methodologists.

Methodologists, people who help teams build a process, improve communications, and improve work efficiency, are essentially agents of change. Right now, our first stream alumni are working with teams and helping them become successful. 

I think that the current situation opens up opportunities and prospects for us that perhaps we ourselves are not yet fully aware of. But the experience and practice that we are gaining right now confirms that we have chosen the right path of development, we will not miss these new opportunities in the future and will be able to respond to the challenges that Sportmaster will face just as effectively.

Conclusions

During this difficult time, we have formulated the main principles on which software development is based, which, I think, will be relevant for every company that does it.

People. This is what it all rests on. Employees must enjoy their work, understand the goals of the company and the goals of the products they are engaged in. And, of course, they could develop professionally. 

Technology. It is necessary that the company has a mature approach to working with its technology stack and builds competencies where it is really needed. Sounds very simple and obvious. And very often ignored.

Processes. It is important to properly build the work of product teams and competence centers, to establish interaction with the business in order to work with it as a partner.

In general, this is how they survived. The main thesis of modernity was confirmed once again, loudly clicking on the forehead

Even if you are a huge offline business with many stores and a bunch of cities of presence, develop your online business. This is not just an additional distribution channel or a beautiful application through which you can also buy something (and also because competitors have beautiful ones too). This is not a spare tire just in case, which will help weather the storm.

This is an absolute necessity. For which not only your technical capacities and infrastructure must be ready, but also people and processes. After all, you can quickly buy more memory, space, deploy new instances, and so on in a couple of hours. But people and processes need to be prepared for this in advance.

Source: habr.com

Add a comment