The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

This is a continuation of a long story about our thorny path to creating a powerful, highly loaded system that ensures the operation of the Exchange. The first part is here: habr.com/en/post/444300

mysterious error

After numerous tests, the updated trading and clearing system was put into operation, and we encountered a bug, about which it is just right to write a detective-mystical story.

Shortly after starting on the main server, one of the transactions was processed with an error. At the same time, everything was in order on the backup server. It turned out that a simple mathematical operation of calculating the exponent on the main server gave a negative result from a real argument! We continued our research, and in the SSE2 register we found a difference in one bit, which is responsible for rounding when working with floating point numbers.

We wrote a simple test utility for calculating the exponent with the rounding bit set. It turned out that in the version of RedHat Linux that we used, there was a bug in the work with the mathematical function, when the unfortunate bit was inserted. We reported this to RedHat, after a while we received a patch from them and rolled it out. The error no longer occurred, but it was not clear where this bit came from in the first place? The function was responsible for it fesetround from the C language. We carefully analyzed our code in search of the alleged error: we checked all possible situations; reviewed all the functions that used rounding; tried to reproduce a failed session; used different compilers with different options; used static and dynamic analysis.

The cause of the error could not be found.

Then they began to check the hardware: they carried out load testing of processors; checked RAM; even ran tests for the very unlikely scenario of a multi-bit error in a single cell. To no avail.

In the end, we settled on a theory from the world of high-energy physics: some high-energy particle flew into our data center, pierced the case wall, hit the processor and caused the trigger latch to stick in that very bit. This absurd theory is called "neutrino". If you are far from elementary particle physics: neutrinos almost do not interact with the outside world, and certainly are not able to affect the operation of the processor.

Since it was not possible to find the cause of the failure, just in case, the β€œguilty” server was excluded from operation.

After some time, we began to improve the hot standby system: we introduced the so-called "warm reserves" (warm) - asynchronous replicas. They received a stream of transactions that may be located in different data centers, but at the same time warms did not support active interaction with other servers.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

What was it for? If the standby server fails, then the warm attached to the main server becomes the new standby. That is, after a failure, the system does not remain until the end of the trading session with one main server.

And when the new version of the system was tested and put into operation, the error with the rounding bit occurred again. Moreover, with the increase in the number of warm servers, the error began to appear more often. At the same time, the vendor had nothing to show, since there is no concrete evidence.

During the next analysis of the situation, a theory arose that the problem could be related to the OS. We have written a simple program that calls a function in an infinite loop fesetround, remembers the current state and checks it through sleep, and this is done in many competing threads. Having picked up the parameters of sleep and the number of threads, we began to steadily reproduce the failure of the bits after about 5 minutes of the utility's operation. However, Red Hat Support was unable to reproduce it. Testing of our other servers showed that only those servers in which certain processors are installed are susceptible to the error. At the same time, switching to a new kernel solved the problem. In the end, we just replaced the OS, and the true cause of the bug remained unexplained.

And suddenly last year an article was published on HabrΓ© β€œHow I found a bug in Intel Skylake processors". The situation described in it was very similar to ours, but the author moved further in the investigation and put forward the theory that the error was in the microcode. And when Linux kernels are updated, manufacturers also update the microcode.

Further development of the system

Although we got rid of the error, this story forced us to reconsider the architecture of the system again. After all, we were not protected from the repetition of such bugs.

The next improvements of the redundancy system are based on the following principles:

  • You can't trust anyone. Servers may not work properly.
  • Majority reservation.
  • Ensuring consensus. As a logical addition to the majority reservation.
  • Double failures are possible.
  • Vitality. The new hot standby scheme should be no worse than the previous one. Trading should go smoothly until the last server.
  • Slight increase in latency. Any downtime entails huge financial losses.
  • Minimal network interaction to keep latency as low as possible.
  • Selecting a new master server in seconds.

None of the solutions available on the market suited us, and the Raft protocol was still in its infancy, so we created our own solution.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

Networking

In addition to the redundancy system, we were engaged in the modernization of network interaction. The I / O subsystem was a lot of processes, which had the worst effect on jitter and latency. With hundreds of processes handling TCP connections, we had to constantly switch between them, and on a microsecond scale, this is quite a long operation. But worst of all, when a process received a packet for processing, it sent it to one SystemV queue and then waited for an event from another SystemV queue. However, with a large number of nodes, the arrival of a new TCP packet in one process and the receipt of data in the queue in another represent two competing events for the OS. In this case, if there are no physical processors available for both tasks, one will be processed, and the second will be queued. It is impossible to predict the consequences.

In such situations, dynamic process priority control can be applied, but this requires the use of resource-intensive system calls. As a result, we switched to a single thread using the classic epoll, which greatly increased the speed and reduced the transaction processing time. We also got rid of separate network interaction processes and interaction through SystemV, significantly reduced the number of system calls and began to control the priorities of operations. On the I / O subsystem alone, it was possible to save about 8-17 microseconds, depending on the scenario. This single-threaded scheme has been used unchanged since then, one epoll-thread with a margin is enough to service all connections.

Transaction processing

The growing load on our system required the modernization of almost all of its components. But, unfortunately, the stagnation in the growth of the clock frequency of processors in recent years no longer allowed scaling processes "on the forehead". Therefore, we decided to divide the Engine process into three levels, with the most loaded of them being the risk check system, which evaluates the availability of funds in the accounts and creates the transactions themselves. But money can be in different currencies, and it was necessary to figure out how to separate the processing of requests.

The logical solution is to divide by currencies: one server trades in dollars, another in pounds, and a third in euros. But if, with such a scheme, two transactions are sent to buy different currencies, then there will be a problem of desynchronization of wallets. And synchronization is difficult and expensive. Therefore, it will be correct to shard separately for wallets and separately for instruments. By the way, for most Western exchanges, the task of checking risks is not as acute as it is for us, so most often this is done offline. We also needed to implement online verification.

Let's explain with an example. A trader wants to buy $30, and the request goes to transaction validation: we check whether this trader is allowed to use this trading mode, and whether he has the necessary rights. If everything is in order, the request goes to the risk check system, i.e. to check the sufficiency of funds for the conclusion of the transaction. There is a note that the required amount is currently blocked. Then the request is redirected to the trading system, which approves or disapproves this transaction. Let's say the transaction is approved - then the risk check system marks that the money is unlocked, and the rubles turn into dollars.

In general, the risk check system contains complex algorithms and performs a large amount of very resource-intensive calculations, and does not just check the β€œaccount balance”, as it might seem at first glance.

Having started dividing the Engine process into levels, we ran into a problem: the code available at that time at the stages of validation and verification actively used the same data array, which required rewriting the entire code base. As a result, we borrowed the instruction processing technique from modern processors: each of them is divided into small stages and several actions are performed in parallel in one cycle.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

After slightly adapting the code, we created a parallel transaction processing pipeline, in which the transaction was divided into 4 stages of the pipeline: networking, validation, executions, and publication of the result

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

Consider an example. We have two processing systems, serial and parallel. The first transaction arrives, and in both systems it is sent for validation. The second transaction immediately comes: in a parallel system, it is immediately taken into operation, and in a serial system, it is queued, waiting for the first transaction to pass the current stage of processing. That is, the main advantage of pipeline processing is that we process the transaction queue faster.

So we got the ASTS+ system.

True, with conveyors, too, not everything is so smooth. Let's say we have a transaction that affects data arrays in a neighboring transaction, this is a typical situation for an exchange. Such a transaction cannot be executed in a pipeline because it may affect others. This situation is called a data hazard, and such transactions are simply processed separately: when the "fast" transactions in the queue end, the pipeline stops, the system processes the "slow" transaction, and then starts the pipeline again. Fortunately, the share of such transactions in the total flow is very small, so the pipeline stops so rarely that this does not affect the overall performance.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

Then we set about solving the problem of synchronizing the three threads of execution. As a result, a system based on a ring buffer with fixed-size cells was born. In this system, everything is subject to processing speed, data is not copied.

  • All incoming network packets go to the allocation stage.
  • We place them in an array and mark them as available for stage #1.
  • The second transaction has arrived, it is again available for stage No. 1.
  • The first processing thread sees the available transactions, processes them, and moves to the next stage of the second processing thread.
  • It then processes the first transaction and flags the corresponding cell with the flag deleted - it is now available for new use.

Thus, the entire queue is processed.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

The processing of each stage takes units or tens of microseconds. And if we use standard OS synchronization schemes, then we will lose more time on the synchronization itself. So we started using spinlock. However, this is very bad form in a real-time system, and RedHat strongly discourages doing so, so we spinlock for 100ms and then go into semaphore mode to eliminate the possibility of deadlock.

As a result, we achieved a performance of about 8 million transactions per second. And just two months later article about LMAX Disruptor saw a description of a circuit with the same functionality.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

Now there could be several threads of execution at one stage. All transactions were processed in turn, in the order they were received. As a result, peak performance increased from 18 to 50 transactions per second.

Exchange risk management system

There is no limit to perfection, and soon we were again engaged in modernization: as part of ASTS +, we began to make risk management and settlement systems into autonomous components. We developed a flexible modern architecture and a new hierarchical risk model, we tried to use the class fixed_point instead double.

But the problem immediately arose: how to synchronize all the business logic that has been working for many years and transfer it to the new system? As a result, the first version of the prototype of the new system had to be abandoned. The second version, which is currently in production, is based on the same code that works both in the trading part and in the risk part. During development, the most difficult thing was to do a git merge between two versions. Our colleague Yevgeny Mazurenok performed this operation every week and every time he cursed for a very long time.

When a new system was allocated, it was immediately necessary to solve the problem of interaction. When choosing a data bus, it was necessary to ensure stable jitter and minimal delay. The InfiniBand RDMA network was the best fit for this: the average processing time is 4 times less than in 10 G Ethernet networks. But what really bribed us was the difference in percentiles - 99 and 99,9.

Of course, InfiniBand has its challenges. First, another API is ibverbs instead of sockets. Secondly, there are almost no widely available open source messaging solutions. We tried to make our own prototype, but it turned out to be very difficult, so we chose a commercial solution - Confinity Low Latency Messaging (formerly IBM MQ LLM).

Then the problem of correct division of the risk system arose. If you just take out the Risk Engine and do not make an intermediate node, then transactions from two sources can be mixed.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

The so-called Ultra Low Latency solutions have a reordering mode: transactions from two sources can line up in the right order upon arrival, this is implemented using a separate channel for exchanging information about the order. But we do not use this mode yet: it complicates the whole process, and in a number of solutions it is not supported at all. In addition, each transaction would have to be assigned the appropriate timestamps, and in our scheme this mechanism is very difficult to implement correctly. Therefore, we used the classic scheme with a message broker, that is, with a dispatcher that distributes messages between the Risk Engine.

The second problem was related to client access: if there are several Risk Gateways, the client needs to connect to each of them, and for this it will be necessary to make changes to the client layer. We wanted to get away from this at this stage, so in the current scheme, the Risk Gateway processes the entire data flow. This severely limits the maximum throughput, but greatly simplifies system integration.

Duplication

Our system should not have a single point of failure, that is, all components must be duplicated, including the message broker. We solved this problem using the CLLM system: it contains an RCMS cluster in which two dispatchers can work in master-slave mode, and when one fails, the system automatically switches to the other.

Working with a backup data center

InfiniBand is optimized to work as a local area network, that is, to connect rack equipment, and there is no way to lay a network between two geographically distributed InfiniBand data centers. Therefore, we implemented a bridge/dispatcher that connects to the message store via ordinary Ethernet networks and relays all transactions to the second IB network. When we need to migrate from a data center, we can choose which data center to work with now.

Results

All of the above was not done at once, it took several iterations of the development of a new architecture. We created the prototype in a month, but finishing up to working condition took more than two years. We tried to achieve the best compromise between increasing the duration of transaction processing and increasing the reliability of the system.

Since the system was heavily updated, we implemented data recovery from two independent sources. If the message store does not function properly for some reason, you can take the transaction log from a second source - from the Risk Engine. This principle is respected throughout the system.

Among other things, we managed to keep the client API so that neither brokers nor anyone else would require a significant rework for the new architecture. I had to change some interfaces, but there was no need to make significant changes to the work model.

We named the current version of our platform Rebus, short for two of the most notable innovations in the architecture, Risk Engine and BUS.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

Initially, we wanted to separate only the clearing part, but the result was a huge distributed system. Clients can now interact with either the Trading Gateway, the Clearing Gateway, or both.

What we finally achieved:

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 2

Reduced latency. With a small volume of transactions, the system works in the same way as the previous version, but at the same time withstands a much higher load.

Peak performance increased from 50 thousand to 180 thousand transactions per second. A further increase is hampered by a single stream of convergence of applications.

There are two ways to improve further: parallelizing matching and changing the scheme of working with Gateway. Now all Gateways work according to the replication scheme, which ceases to function normally under such a load.

Finally, I can give some advice to those who are finalizing enterprise systems:

  • Always be prepared for the worst. Problems always come unexpectedly.
  • As a rule, it is impossible to quickly remake the architecture. Especially if you need to achieve maximum reliability in a variety of indicators. The more nodes, the more support resources are needed.
  • All special and proprietary solutions will require additional resources for research, support and maintenance.
  • Do not put off resolving issues of reliability and recovery of the system after failures, take them into account at the initial design stage.

Source: habr.com

Add a comment