Rewrite the VKontakte message base from scratch and survive

Our users write messages to each other, not knowing fatigue.
Rewrite the VKontakte message base from scratch and survive
That's quite a lot. If you set out to read all the messages of all users, it would take more than 150 thousand years. Provided that you are a fairly pumped reader and spend no more than a second on each message.

With such a volume of data, it is critical that the logic for storing and accessing them is built optimally. Otherwise, at one not-so-great moment, it may turn out that soon everything will go wrong.

For us, that moment came a year and a half ago. How we came to this and what happened as a result - we tell in order.

Π˜ΡΡ‚ΠΎΡ€ΠΈΡ вопроса

In the very first implementation, VKontakte messages worked on a combination of PHP backend and MySQL. This is quite a normal solution for a small student site. However, this site grew uncontrollably and began to demand that the data structures be optimized for themselves.

At the end of 2009, the first text-engine repository was written, and in 2010 messages were transferred to it.

In the text-engine, messages were stored in lists - a kind of "mailbox". Each such list is defined by a uid, the owner user of all those messages. The message has a set of attributes: interlocutor ID, text, attachments, and so on. The message ID inside the "box" is local_id, it never changes and is assigned sequentially for new messages. The "boxes" are independent and do not synchronize with each other inside the engine, the connection between them occurs already at the PHP level. You can look at the data structure and capabilities of text-engine from the inside here.
Rewrite the VKontakte message base from scratch and survive
This was quite enough for the correspondence of two users. Guess what happened next?

In May 2011, conversations with several participants appeared on VKontakte - multichats. To work with them, we raised two new clusters - member-chats and chat-members. The first stores data about chats by users, the second stores data about users by chats. In addition to the lists themselves, these are, for example, the user who invited and the time they were added to the chat.

β€œPHP, let’s send a chat message,” the user says.
β€œCome on, {username},” says PHP.
Rewrite the VKontakte message base from scratch and survive
There are downsides to this scheme. Synchronization is still assigned to PHP. Large chat rooms and users who simultaneously send messages to them is a dangerous story. Since the text-engine instance depends on uid, chat participants could receive the same message with a time difference. One could live with this if progress stood still. But don't be like that.

At the end of 2015, we launched community messages, and at the beginning of 2016, we launched an API for them. With the advent of large chatbots in communities, the uniform distribution of the load could be forgotten.

A good bot generates several million messages per day - even the most talkative users cannot boast of this. And this means that some copies of the text-engine, on which such bots lived, began to get to the fullest.

Message engines in 2016 are 100 instances of chat-members and member-chats, and 8000 text-engines. They were hosted on a thousand servers, each with 64 GB of memory. As a first emergency measure, we increased the memory by another 32 GB. Made predictions. Without cardinal changes, this would have been enough for about a year. It is necessary either to get hold of iron, or to optimize the databases themselves.

Due to the peculiarities of the architecture, it makes sense to build up iron only in multiples. That is, at least doubling the number of cars - obviously, this is a rather expensive path. Let's optimize.

New concept

The central essence of the new approach is chat. The chat has a list of messages that are related to it. The user has a list of chats.

The required minimum is two new databases:

  • chat-engine. This is a repository of chat vectors. Each chat has a vector of messages that are related to it. Each message has a text and a unique message identifier inside the chat - chat_local_id.
  • user-engine. This is a storage of users vectors - links to users. Each user has a peer_id vector (interlocutors - other users, multichats or communities) and a message vector. Each peer_id has a vector of messages that refer to it. Each message has a chat_local_id and a unique message id for that user, user_local_id.

Rewrite the VKontakte message base from scratch and survive
New clusters communicate with each other using TCP - this ensures that the order of requests does not change. The requests themselves and confirmations for them are written to the hard disk - so we can restore the state of the queue at any time after a crash or restart of the engine. Since user-engine and chat-engine are 4 thousand shards each, the request queue between clusters will be distributed evenly (and in reality there is none at all - and it works very quickly).

Working with a disk in our databases in most cases is based on a combination of a binary changelog (binlog), static snapshots and a partial image in memory. Changes during the day are written to the binlog, a snapshot of the current state is periodically created. A snapshot is a collection of data structures optimized for our purposes. It consists of a header (meta-index of the snapshot) and a set of metafiles. The header is permanently stored in RAM and indicates where to look for data from the snapshot. Each metafile includes data that is likely to be needed at close points in timeβ€”pertaining to the same user, for example. When querying the database using the snapshot header, the required metafile is read, and then changes in the binlog that occurred after the snapshot was created are taken into account. Read more about the benefits of this approach. here.

At the same time, the data on the hard drive itself changes only once a day - late at night in Moscow, when the load is minimal. Thanks to this (knowing that the structure on the disk is constant during the day), we can afford to replace vectors with arrays of a fixed size - and due to this, win in memory.

Sending a message in the new scheme looks like this:

  1. The PHP backend calls the user-engine with a request to send a message.
  2. user-engine proxies the request to the desired chat-engine instance, which returns to user-engine chat_local_id - a unique identifier for a new message within this chat. The chat_engine then broadcasts the message to all recipients in the chat.
  3. user-engine takes a chat_local_id from chat-engine and returns to PHP user_local_id - a unique message identifier for this user. This identifier is then used, for example, to work with messages via the API.

Rewrite the VKontakte message base from scratch and survive
But in addition to actually sending messages, you need to implement a few more important things:

  • Sublists are, for example, the most recent messages that you see when you open a list of conversations. Unread messages, messages with tags ("Important", "Spam", etc.).
  • Compressing messages in chat-engine
  • Message caching in user-engine
  • Search (in all dialogs and within a specific one).
  • Real-time update (Longpolling).
  • Saving history to implement caching on mobile clients.

All sublists are rapidly changing structures. To work with them, we use Splay trees. This choice is explained by the fact that at the top of the tree we sometimes store a whole segment of messages from the snapshot - for example, after a nightly reindexing, the tree consists of one vertex, which contains all the messages of the sublist. The splay tree makes it easy to insert into the middle of such a vertex without thinking about balancing. In addition, Splay does not store unnecessary data, and this saves us memory.

Messages imply a large amount of information, mostly text, which is useful to be able to compress. In doing so, it is important that we can accurately decompress even one single message. Used to compress messages. huffman algorithm with our own heuristics - for example, we know that in messages words alternate with β€œnon-words” - spaces, punctuation marks - and we also remember some features of the use of symbols for the Russian language.

Since there are far fewer users than chats, to save random-access disk requests in chat-engine, we cache messages in user-engine.

Message search is implemented as a diagonal query from user-engine to all chat-engine instances that contain that user's chats. The results are combined already in the user-engine itself.

Well, all the details are taken into account, it remains to switch to a new scheme - and preferably so that users do not notice this.

Data migration

So, we have a text-engine that stores messages by user, and two clusters chat-members and member-chats that store data about multichats and users in them. How to move from this to the new user-engine and chat-engine?

member-chats in the old scheme was mainly used for optimization. We rather quickly transferred the necessary data from it to chat-members, and then it no longer participated in the migration process.

Queue for chat-members. It includes 100 instances, while chat-engine has 4 thousand. To transfer the data, you need to bring them into line - for this, the chat-members were divided into the same 4 thousand copies, and then the reading of the chat-members binlog was included in the chat-engine.
Rewrite the VKontakte message base from scratch and survive
Now the chat-engine knows about multi-chats from chat-members, but it does not yet know anything about dialogues with two interlocutors. Such dialogs are in the text-engine with a binding to users. Here we took the data β€œon the forehead”: each instance of the chat-engine asked all instances of the text-engine if they had the dialogue it needed.

Great - chat-engine knows what multichats are, and knows what dialogues are.
You need to merge messages in multichats so that in the end for each chat you get a list of messages in it. First, chat-engine fetches all user messages from this chat from text-engine. In some cases, there are quite a lot of them (up to hundreds of millions), but with very rare exceptions, the chat is completely placed in RAM. We have unordered messages, each in several copies - after all, they are all pulled out from different text-engine instances corresponding to users. The task is to sort messages and get rid of copies that take up extra space.

Each message has a timestamp containing the time it was sent, and a text. We use the time for sorting - we place pointers to the oldest messages of the multichat participants and compare the hashes from the text of the alleged copies, moving in the direction of increasing the timestamp. It is logical that the copies will have the same hash and timestamp, but in practice this is not always the case. As you remember, synchronization in the old scheme was carried out by PHP - and in rare cases, the time of sending the same message differed for different users. In these cases, we allowed ourselves to edit the timestamp - usually within a second. The second problem is the different order of messages for different recipients. In such cases, we allowed the creation of an extra copy, with different order options for different users.

After that, data about messages in multichats is sent to the user-engine. And here comes an unpleasant feature of imported messages. In normal operation, messages that come to the engine are ordered strictly in ascending order of user_local_id. Messages imported from the old engine into user-engine lost this useful property. At the same time, for the convenience of testing, you need to be able to quickly access them, look for something in them and add new ones.

We use a special data structure to store imported messages.

It is a vector of size Rewrite the VKontakte message base from scratch and survivewhere everything Rewrite the VKontakte message base from scratch and survive β€” are distinct and ordered in descending order, with a special order of elements. In each segment with indices Rewrite the VKontakte message base from scratch and survive elements are sorted. The search for an element in such a structure takes time Rewrite the VKontakte message base from scratch and survive via Rewrite the VKontakte message base from scratch and survive binary searches. Adding an element is amortized for Rewrite the VKontakte message base from scratch and survive.

So, we figured out how to transfer data from old engines to new ones. But this process takes several days - and it is unlikely that our users will give up the habit of writing to each other these days. In order not to lose messages during this time, we switch to a work scheme that involves both old and new clusters.

Data is written to chat-members and user-engine (and not to text-engine, as in normal operation according to the old scheme). user-engine proxies the request to chat-engine - and here the behavior depends on whether this chat is already connected or not. If the chat is not yet merged, the chat-engine does not write the message to itself, and its processing occurs only in the text-engine. If the chat is already merged in the chat-engine, it returns the chat_local_id in the user-engine and broadcasts the message to all recipients. user-engine proxies all the data in the text-engine - so that in which case we can always roll back, having all the relevant data in the old engine. text-engine returns the user_local_id, which the user-engine keeps and returns to the backend.
Rewrite the VKontakte message base from scratch and survive
As a result, the transition process looks like this: we connect empty user-engine and chat-engine clusters. chat-engine reads the entire chat-members binlog, then starts proxying as described above. We pour the old data, we get two synchronized clusters (old and new). It remains only to switch reading from text-engine to user-engine and disable proxying.

The results

Thanks to the new approach, all the performance metrics of the engines have improved, problems with data consistency have been resolved. Now we can quickly implement new features in messages (and have already begun to do this - we increased the maximum number of chat participants, implemented a search for forwarded messages, launched pinned messages and raised the limit on the total number of messages per user).

The changes in logic are really grandiose. And I want to note that this does not always mean whole years of development by a huge team and myriad lines of code. chat-engine and user-engine, along with all the additional histories like Huffman for message compression, Splay trees, and structure for imported messages, is less than 20 thousand lines of code. And they were written by 3 developers in just 10 months (however, it should be borne in mind that all three developer - world champions in sports programming).

Moreover, instead of doubling the number of servers, we have come to halve their number - now the user-engine and chat-engine live on 500 physical machines, while the new scheme has a large load margin. We saved a lot of money on equipment - that's about $5 million + $750 thousand per year due to operating costs.

We strive to find the best solutions for the most complex and large-scale tasks. We have plenty of them - and therefore we are looking for talented developers in the database department. If you love and know how to solve such problems, you know algorithms and data structures very well, we invite you to join the team. Contact our HRfor details.

Even if this story is not about you, please note that we appreciate recommendations. Tell a friend about developer vacancies, and if he successfully passes the trial period, you will receive a bonus of 100 thousand rubles.

Source: habr.com

Add a comment