"Walking in my shoes" - stop, are they marked?

Since 2019, a law on mandatory labeling has been in force in Russia. The law does not apply to all groups of goods, and the timing of the entry into force of mandatory labeling for product groups is different. Tobacco, shoes, medicines are the first to be subject to mandatory labeling, and other goods, such as perfumes, textiles, milk, will be added later. This legislative innovation prompted the development of new IT solutions that will allow tracking the entire chain of product life from the moment of production to purchase by the end consumer, all participants in the process: both the state itself and all organizations selling goods with mandatory labeling.

In X5, the system that will track tagged goods and share data with the government and suppliers is called β€œMarkus”. We will tell you in order how and who developed it, what kind of technology stack it has, and why we have something to be proud of.

"Walking in my shoes" - stop, are they marked?

Real High Load

"Markus" solves many problems, the main one is the integration interaction between X5 information systems and the state information system of labeled products (GIS MP) to track the movement of labeled products. The platform also stores all labeling codes received by us and the entire history of the movement of these codes through objects, helping to eliminate the re-sorting of labeled products. On the example of tobacco products, which were included in the first groups of labeled goods, only one wagon of cigarettes contains about 600 packs, each of which has its own unique code. And the task of our system is to track and verify the legality of the movements of each such pack between warehouses and stores, and ultimately verify the admissibility of their sale to the final buyer. And we record about 000 cash transactions per hour, and we also need to record how each such pack got into the store. Thus, taking into account all the movements between objects, we expect tens of billions of records per year.

Team M

Despite the fact that Markus is considered a project within X5, it is being implemented using a product approach. The team works according to Scrum. The project started in the summer of last year, but the first results came only in October - our own team was fully assembled, the system architecture was developed, and equipment was purchased. Now there are 16 people in the team, six of whom are engaged in the development of backend and frontend, three of which are in system analysis. Six more people are involved in manual, load, automated testing, and product maintenance. In addition, we have an SRE specialist.

The code in our team is written not only by developers, almost all the guys know how to program and write autotests, load scripts and automation scripts. We pay special attention to this, since even product support requires a high level of automation. We always try to give advice and help to colleagues who have not programmed before, to give some small tasks to work.

In connection with the coronavirus pandemic, we transferred the entire team to remote work, the availability of all development management tools, the workflow built in Jira and GitLab made it easy to go through this stage. The months spent remotely showed that the team's productivity did not suffer from this, for many, the comfort of work increased, the only thing missing was live communication.

Team meeting before remote

"Walking in my shoes" - stop, are they marked?

Meetings while away

"Walking in my shoes" - stop, are they marked?

Solution technology stack

The standard repository and CI/CD tool for X5 is GitLab. We use it for code storage, continuous testing, deployment to test and production servers. We also use the practice of code review, when at least 2 colleagues need to approve changes made by the developer to the code. The SonarQube and JaCoCo static code analyzers help us keep the code clean and provide the required level of unit test coverage. All changes in the code must go through these checks. All test scripts that are manually run are subsequently automated.

For the successful implementation of business processes by Markus, we had to solve a number of technological tasks, each in order.

Task 1. The need for horizontal scalability of the system

To solve this problem, we have chosen a microservice approach to architecture. At the same time, it was very important to understand the areas of responsibility of services. We tried to divide them by business operations, taking into account the specifics of the processes. For example, acceptance at a warehouse is not a very frequent, but very voluminous operation, during which it is necessary to obtain information from the state regulator as quickly as possible about the units of goods received, the number of which in one delivery reaches 600000, check the admissibility of receiving this product at the warehouse and give all necessary information to the warehouse automation system. But shipment from warehouses has a much greater intensity, but at the same time it operates with small amounts of data.

We implement all services according to the stateless principle, and we even try to divide internal operations into steps, using, as we call them, Kafka self-topics. This is when the microservice sends a message to itself, which allows you to balance the load on more resource-intensive operations and simplifies product maintenance, but more on that later.

We decided to separate modules for interaction with external systems into separate services. This allowed us to solve the problem of frequently changing APIs of external systems, with little or no impact on services with business functionality.

"Walking in my shoes" - stop, are they marked?

All microservices are deployed in an OpenShift cluster, which solves both the problem of scaling each microservice and allows us not to use third-party Service Discovery tools.

Task 2. The need to maintain a high load and very intensive data exchange between platform services: only at the project launch phase, about 600 operations per second are performed. We expect this value to increase to 5000 ops/sec as trading objects are connected to our platform.

This task was solved by deploying a Kafka cluster and almost completely abandoning synchronous interaction between platform microservices. This requires a very careful analysis of the system requirements, as not all operations can be asynchronous. At the same time, we not only send events through the broker, but also send all the required business information in the message. Thus, the message size can be up to several hundred kilobytes. The message size limit in Kafka requires us to accurately predict the size of the messages, and we divide them if necessary, but the division is a logical one related to business operations.
For example, the goods arrived in the car, we divide into boxes. Separate microservices are allocated for synchronous operations and thorough load testing is carried out. Using Kafka presented us with another challenge - checking the operation of our service, taking into account Kafka integration, makes all our unit tests asynchronous. We solved this problem by writing our own utility methods using the Embedded Kafka Broker. This does not eliminate the need to write unit tests for individual methods, but we prefer to test complex cases using Kafka.

A lot of attention was paid to tracing logs so that their TraceId is not lost when exceptions occur during the operation of services or when working with Kafka batch. And if there were no special issues with the first one, then in the second case we are forced to log all the TraceIds that the batch came with and select one to continue tracing. Then, when searching by the initial TraceId, the user can easily find out how the trace continued.

Task 3. The need to store a large amount of data: more than 1 billion labels per year for tobacco alone enter X5. They require constant and fast access. In total, the system must process about 10 billion records on the history of the movement of these marked goods.

To solve the third problem, the MongoDB NoSQL database was chosen. We have built a shard of 5 nodes and each node has a Replica Set of 3 servers. This allows you to scale the system horizontally by adding new servers to the cluster and ensure its fault tolerance. Here we are faced with another problem - ensuring transactionality in the mongo cluster, taking into account the use of horizontally scalable microservices. For example, one of the tasks of our system is to detect attempts to resell products with the same marking codes. Here there are overlays with erroneous scans or with erroneous operations of cashiers. We found that such duplicates can occur both inside one processed Kafka batch, and inside two parallel processed batches. Thus, checking for the occurrence of duplicates by querying the database did not give anything. For each of the microservices, we solved the problem separately based on the business logic of this service. For example, for checks, we added a check inside the batch and a separate processing for duplicates when inserted.

So that the work of users with the history of operations does not affect the most important thing - the functioning of our business processes, we have allocated all historical data into a separate service with a separate database, which also receives information through Kafka. Thus, users work with an isolated service without affecting the services that process data on current operations.

Task 4. Queue reprocessing and monitoring:

In distributed systems, problems and errors in the availability of databases, queues, and external data sources inevitably arise. In the case of Marcus, the source of such errors is integration with external systems. It was necessary to find a solution that would allow making repeated requests based on erroneous responses with some given timeout, but at the same time not stopping the processing of successful requests in the main queue. For this, the so-called β€œtopic based retry” concept was chosen. For each main topic, one or more retry topics are created, to which erroneous messages are sent and at the same time there is no delay in processing messages from the main topic. Interaction scheme -

"Walking in my shoes" - stop, are they marked?

To implement such a scheme, we needed the following - to integrate this solution with Spring and avoid code duplication. On the web, we came across a similar solution based on the Spring BeanPostProccessor, but we found it unnecessarily cumbersome. Our team made a simpler solution that allows you to fit into the Spring cycle for creating a consumer and additionally add Retry Consumers. We offered the prototype of our solution to the Spring team, you can see it here. The number of Retry Consumers and the number of attempts for each consumer is configured through parameters, depending on the needs of the business process, and in order for everything to work, all that remains is to put the org.springframework.kafka.annotation.KafkaListener annotation familiar to all Spring developers.

If the message could not be processed after all retry attempts, it gets into the DLT (dead letter topic) using the Spring DeadLetterPublishingRecoverer. At the request of support, we have expanded this functionality and created a separate service that allows you to view messages that got into the DLT, stackTrace, traceId and other useful information on them. In addition, monitoring and alerts were added to all DLT topics, and now, in fact, the appearance of a message in a DLT topic is a reason for analyzing and introducing a defect. This is very convenient - by the name of the topic, we immediately understand at what step in the process the problem arose, which greatly speeds up the search for its root cause.

"Walking in my shoes" - stop, are they marked?

More recently, we have implemented an interface that allows us to resend messages by our support team after eliminating their causes (for example, restoring an external system) and, of course, introducing a corresponding defect for analysis. This is where our self-topics came in handy, so as not to restart a long processing chain, you can restart it from the desired step.

"Walking in my shoes" - stop, are they marked?

Platform operation

The platform is already in productive operation, every day we carry out deliveries and shipments, connect new distribution centers and stores. As part of the pilot, the system works with the Tobacco and Footwear product groups.

Our entire team participates in conducting pilots, analyzes emerging problems and makes suggestions for improving our product from improving logs to changing processes.

In order not to repeat our mistakes, all cases found during the pilot are reflected in automated tests. The presence of a large number of autotests and unit tests allows you to conduct regression testing and put a hotfix in just a few hours.

Now we continue to develop and improve our platform, and constantly face new challenges. If you are interested, we will talk about our solutions in the following articles.

Source: habr.com

Add a comment