The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

Hi all! My name is Sergey Kostanbaev, at the Exchange I am developing the core of the trading system.

When they show the New York Stock Exchange in Hollywood movies, it always looks like this: crowds of people, everyone yelling something, waving papers, there is complete chaos. We have never had this on the Moscow Exchange, because trading has been conducted electronically from the very beginning and is based on two main platforms - Spectra (forward market) and ASTS (currency, stock and money market). And today I want to talk about the evolution of the architecture of the ASTS trading and clearing system, about various solutions and findings. The story will be long, so I had to break it into two parts.

We are one of the few exchanges in the world that trade assets of all classes and provide a full range of exchange services. For example, last year we ranked second in the world in terms of bond trading volume, 25th among all stock exchanges, 13th in terms of capitalization among public exchanges.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

For professional bidders, such parameters as response time, time distribution stability (jitter) and the reliability of the entire complex are critical. We currently process tens of millions of transactions per day. The processing of each transaction by the system core takes tens of microseconds. Of course, for cellular operators on New Year's Eve or for search engines, the workload itself is higher than ours, but in terms of workload, coupled with the above characteristics, few can compare with us, as it seems to me. At the same time, it is important for us that the system does not slow down for a second, that it works absolutely stably, and that all users are on an equal footing.

A little history

In 1994, the Australian ASTS system was launched at the Moscow Interbank Currency Exchange (MICEX), and from that moment on, the Russian history of electronic trading can be counted. In 1998, the architecture of the exchange was modernized in order to introduce online trading. Since then, the speed of implementation of new solutions and architectural changes in all systems and subsystems has only been gaining momentum.

In those years, the exchange system worked on hi-end hardware - ultra-reliable HP Superdome 9000 servers (built on the architecture PARISC), which duplicated absolutely everything: I / O subsystems, network, RAM (in fact, there was a RAID array of RAM), processors (hot swap was supported). It was possible to change any component of the server without stopping the machine. We relied on these devices, considered them virtually trouble-free. The operating system was a Unix-like HP UX system.

But since about 2010, a phenomenon such as high-frequency trading (HFT) has emerged, or high-frequency trading - simply put, exchange robots. In just 2,5 years, the load on our servers has increased 140 times.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

It was impossible to withstand such a load with the old architecture and equipment. I had to somehow adapt.

Home

Requests to the exchange system can be divided into two types:

  • transactions. If you want to buy dollars, stocks, or something else, you send a transaction to the trading system and receive a success response.
  • information requests. If you want to know the current price, see the order book or indexes, then send information requests.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

Schematically, the core of the system can be divided into three levels:

  • The client level at which brokers work, clients. All of them interact with access servers.
  • Access servers (Gateways) are caching servers that process all information requests locally. Do you want to know at what price Sberbank shares are currently trading? The request goes to the access server.
  • But if you want to buy shares, then the request goes to the central server (Trade Engine). There is one such server for each type of market, they play a crucial role, and it is for them that we created this system.

The core of the trading system is a tricky in-memory database in which all transactions are exchange transactions. The base was written in C, from external dependencies there was only the libc library and there was no dynamic memory allocation at all. To reduce processing time, the system starts with a static set of arrays and with static data relocation: first, all data for the current day is loaded into memory, and then no disk accesses are performed, all work is done only in memory. When the system starts, all reference data is already sorted, so the search works very efficiently and takes little time at runtime. All tables are made with intrusive lists and trees for dynamic data structures so that they don't require memory allocation at runtime.

Let's briefly go over the history of the development of our trading and clearing system.
The first version of the architecture of the trading and clearing system was built on the so-called Unix interaction: shared memory, semaphores and queues were used, and each process consisted of a single thread. This approach was widely adopted in the early 1990s.

The first version of the system contained two levels of the Gateway and the central server of the trading system. The scheme of work was as follows:

  • The client sends a request that gets to the Gateway. It checks the validity of the format (but not the data itself) and rejects invalid transactions.
  • If an information request has been sent, it is executed locally; if we are talking about a transaction, then it is redirected to the central server.
  • The trading engine then processes the transaction, modifies the local memory, and sends the response to the transaction, and the transaction itself to replication using a separate replication mechanism.
  • The gateway receives the response from the central node and forwards it to the client.
  • After some time, the Gateway receives the transaction through the replication mechanism, and this time it executes it locally, changing its data structures so that the next information requests reflect the actual data.

In fact, the replication model is described here, in which the Gateway completely repeated the actions performed in the trading system. A separate replication channel provided the same transaction execution order on multiple access nodes.

Since the code was single-threaded, the classical scheme with process forks was used to serve many clients. However, it was very expensive to fork the entire database, so lightweight service processes were used that collected packets from TCP sessions and put them into one queue (SystemV Message Queue). Gateway and Trade Engine worked only with this queue, taking transactions from there for execution. It was no longer possible to send a response to it, because it is not clear which service process should read it. So we resorted to a trick: each forked process created a response queue for itself, and when a request came into the incoming queue, a tag for the response queue was immediately added to it.

The constant copying of large amounts of data from queue to queue created problems, especially for information requests. Therefore, we used another trick: in addition to the response queue, each process also created a shared memory (SystemV Shared Memory). The packages themselves were placed in it, and only the tag was stored in the queue, allowing you to find the original package. This helped keep the data in the processor's cache.

SystemV IPC includes utilities for viewing the status of queue, memory, and semaphore objects. We actively used this to understand what is happening in the system at a particular moment, where packets accumulate, what is blocked, etc.

First upgrades

First of all, we got rid of the single-process Gateway. Its significant disadvantage was that it could process either one replication transaction or one information request from the client. And as the load grows, the Gateway will take longer to process requests and will not be able to process the replication flow. In addition, if the client sent a transaction, then you only need to check its validity and redirect further. Therefore, we have replaced one Gateway process with many components that can work in parallel: multi-threaded information and transactional processes that work independently on a common memory area using RW locking. And at the same time we introduced the processes of dispatching and replication.

Impact of High Frequency Trading

The above version of the architecture lasted until 2010. In the meantime, we were no longer satisfied with the performance of the HP Superdome servers. In addition, the PA-RISC architecture actually died, the vendor did not offer any significant updates. As a result, we started moving from HP UX/PA RISC to Linux/x86. The transition began with the adaptation of access servers.

Why did we have to change the architecture again? The fact is that high-frequency trading has significantly changed the load profile on the core of the system.

Let's say we have a small transaction that caused a significant price change - someone bought half a billion dollars. After a couple of milliseconds, all market participants notice this and begin to correct. Naturally, requests line up in a huge queue, which the system will take a long time to clear.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

At this interval of 50 ms, the average speed is about 16 thousand transactions per second. If we reduce the window to 20 ms, we will get an average speed of 90 thousand transactions per second, and at the peak there will be 200 thousand transactions. In other words, the load is unstable, with sharp bursts. And the queue of requests must always be processed quickly.

But why is there a queue at all? So, in our example, a lot of users have noticed the price change and are sending corresponding transactions. Those come to the Gateway, it serializes them, sets a certain order and sends them to the network. Routers shuffle packets and forward them on. Whose packet came first, that transaction "won". As a result, exchange customers began to notice that if the same transaction is sent from several Gateways, then the chances of its fast processing increase. Soon, exchange robots began to bombard the Gateway with requests, and an avalanche of transactions arose.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

A new round of evolution

After a long period of testing and research, we switched to the real-time operating system kernel. To do this, we chose RedHat Enterprise MRG Linux, where MRG stands for messaging real-time grid. The advantage of real-time patches is that they optimize the system for the fastest possible execution: all processes are lined up in a FIFO queue, cores can be isolated, no throwouts, all transactions are processed in strict sequence.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1
Red - work with the queue in the normal kernel, green - work in the real-time kernel.

But achieving low latency on normal servers is not easy:

  • The SMI mode, which underlies the work with important peripherals in the x86 architecture, strongly interferes. Processing of all kinds of hardware events and management of components and devices is performed by the firmware in the so-called transparent SMI mode, in which the operating system does not see at all what the firmware is doing. As a rule, all major vendors offer special extensions for firmware servers that reduce the amount of SMI processing.
  • There should be no dynamic control of the processor frequency, this leads to additional downtime.
  • When the filesystem journal is flushed, some processes occur in the kernel that cause unpredictable delays.
  • You need to pay attention to things like CPU Affinity, Interrupt affinity, NUMA.

I must say, the topic of setting up hardware and the Linux kernel for realtime processing deserves a separate article. We spent a lot of time experimenting and researching before reaching a good result.

When moving from PA-RISC servers to x86, we practically did not have to change the system code much, we only adapted and reconfigured it. At the same time, several bugs were fixed. For example, the consequences of PA RISC being a Big endian system and x86 being a Little endian quickly surfaced: for example, data was read incorrectly. A trickier bug was that PA RISC uses consistently consistent (Sequentially consistent) memory access, while x86 can reorder read operations, so code that was perfectly valid on one platform became broken on another.

After switching to x86, performance increased almost three times, the average transaction processing time decreased to 60 Β΅s.

Let's now take a closer look at what key changes were made to the system architecture.

Hot Standby Epic

When switching to commodity servers, we were aware that they were less reliable. Therefore, when creating a new architecture, we a priori assumed the possibility of failure of one or more nodes. Therefore, a hot standby system was needed, capable of switching to backup machines very quickly.

In addition, there were other requirements:

  • Under no circumstances should processed transactions be lost.
  • The system must be absolutely transparent to our infrastructure.
  • Clients should not see connection breaks.
  • Reservation should not introduce significant delay, because this is a critical factor for the exchange.

When creating a hot standby system, we did not consider such scenarios as double failures (for example, the network on one server stopped working and the main server hung); did not consider the possibility of errors in the software, because they are detected during testing; and did not consider the incorrect operation of the iron.

As a result, we came up with the following scheme:

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

  • The master server interacted directly with the Gateway servers.
  • All transactions that came to the main server were instantly replicated to the backup server via a separate channel. The arbiter (Governor) coordinated the switch if any problems occurred.

    The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

  • The main server processed each transaction and waited for confirmation from the backup server. To keep the delay to a minimum, we abandoned waiting for the transaction to complete on the standby server. Since the duration of the transaction moving through the network was comparable to the execution time, no additional delay was added.
  • We could only compare the processing status of the main and backup servers for the previous transaction, and the processing status of the current transaction was unknown. Since it was still using single-threaded processes, waiting for a response from Backup would slow down the entire processing flow, and therefore we made a reasonable compromise: we checked the result of the previous transaction.

The evolution of the architecture of the trading and clearing system of the Moscow Exchange. Part 1

The scheme worked as follows.

Let's say the main server has stopped responding, but the Gateways continue to communicate. A timeout is triggered on the backup server, it turns to the Governor, and he assigns him the role of the main server, and all Gateways switch to the new main server.

If the main server comes back online, it also triggers an internal timeout because the server has not been contacted by the Gateway for a certain amount of time. Then he also turns to the Governor, and he excludes him from the scheme. As a result, the exchange works with one server until the end of the trading period. Since the probability of a server failure is quite low, such a scheme was considered quite acceptable, it did not contain complex logic and was easily tested.

To be continued.

Source: habr.com

Add a comment