New generation billing architecture: transformation with the transition to Tarantool

Why does a corporation like MegaFon need Tarantool in billing? From the outside, it seems that a vendor usually comes, brings some kind of large box, plugs the plug into an outlet - that's billing! Once upon a time it was, but now it is archaic, and such dinosaurs have already died out or are dying out. Initially, billing is a system for invoicing - a rhyme or calculator. In today's telecom automation system for the entire life cycle of interaction with a subscriber from the conclusion of an agreement to termination, including real-time billing, payment acceptance and much more. Billing in telecom companies is similar to a combat robot - large, powerful and hung with weapons.

New generation billing architecture: transformation with the transition to Tarantool

And what about Tarantool? Will tell about it Oleg Ivlev ΠΈ Andrey Knyazev. Oleg is the chief architect of the company MegaFon with extensive experience in foreign companies, Andrey is the director of business systems. From a transcript of their report on Tarantool Conference 2018 you will learn why R&D is needed in corporations, what Tarantool is, how the vertical scaling dead end and globalization became the prerequisites for the appearance of this database in the company, about technological challenges, architecture transformation, and how MegaFon technostack is similar to Netflix, Google and Amazon.

Project "Unified billing"

The project that will be discussed is called "Unified Billing". It was in it that Tarantool showed its best qualities.

New generation billing architecture: transformation with the transition to Tarantool

The performance growth of Hi-End equipment did not keep pace with the growth of the subscriber base and the growth in the number of services, further growth in the number of subscribers and services was expected due to M2M, IoT, and branch features led to a deterioration in time-to-market. The company decided to create a single business system with a unique world-class modular architecture, instead of 8 current different billing systems.

MegaFon is eight companies in one. In 2009, the reorganization was completed: branches throughout Russia merged into a single company MegaFon OJSC (now PJSC). Thus, the company has 8 billing systems with their own "custom" solutions, branch features and different organizational structure, IT and marketing.

Everything was fine until one common federal product had to be launched. A lot of difficulties appeared here: someone has billing with rounding up, someone has a smaller one, and someone has an arithmetic average. There are thousands of such moments.

Despite the fact that the version of the billing system is one, one supplier, the settings diverged so that it takes a long time to glue. We tried to reduce their number, and stumbled upon the second problem, which is familiar to many corporations.

Vertical scaling. Even the coolest iron at that time did not meet the needs. We used Hewlett-Packard equipment, the Superdome Hi-End line, but it did not even meet the needs of two branches. I wanted horizontal scaling without high operating costs and capital investments.

Expectation of growth in the number of subscribers and services. Consultants have long brought stories about IoT and M2M to the telecom world: there will come a time when every phone and iron will have a SIM card, and two in the refrigerator. Today we have one number of subscribers, and in the near future there will be an order of magnitude more.

Technological challenges

These four reasons drove us to major changes. There was a choice between upgrading the system and designing from scratch. We thought for a long time, made serious decisions, played tenders. As a result, we decided to design from the very beginning, and took up interesting challenges - technological challenges.

Scalability

If before it was, conditionally say, 8 billings for 15 million subscribersand now it should be 100 million subscribers and more - the load is much higher.

We have become comparable in scale to major Internet players like Mail.ru or Netflix.

But further movement to increase the load and the subscriber base has set serious tasks for us.

The geography of our vast country

Between Kaliningrad and Vladivostok 7500 km and 10 time zones. The speed of light is finite and at such distances the delays are already significant. 150 ms on the coolest modern optical channels is a bit too much for real-time billing, especially such as it is now in telecom in Russia. In addition, you need to update in one business day, and with different time zones is a problem.

We do not just provide services for a monthly fee, we have complex tariffs, packages, various modifiers. We need not only to allow or forbid the subscriber to talk, but to give him a certain quota - to calculate calls and actions in real time so that he does not notice.

fault tolerance

This is the other side of centralization.

If we collect all subscribers in one system, then any emergency events and disasters are deplorable for business. Therefore, we design the system in such a way as to exclude the impact of accidents on the entire subscriber base.

This is a consequence, again, of the rejection of vertical scaling. When we went into horizontal scaling, we increased the number of servers from hundreds to thousands. They need to be managed and interchangeable, automatically back up IT infrastructure and restore a distributed system.

Such interesting challenges were before us. We designed the system, and at that moment we tried to find the world's best practices to check how trendy we are, how much we follow advanced technologies.

World experience

Surprisingly, we did not find a single reference in the world telecom.

Europe has disappeared in terms of the number of subscribers and the scale, the United States - in terms of the flatness of its tariffs. They looked at something in China, and found something in India and took specialists from Vodafone India.

To analyze the architecture, they assembled a Dream Team led by IBM - architects from different areas. These people could adequately assess what we are doing and bring certain knowledge into our architecture.

Scale

A few numbers to illustrate.

We design a system for 80 million subscribers with over a billion. This is how we remove future thresholds. This is not because we are going to take over China, but because of the pressure of IoT and M2M.

300 million documents processed in real time. Although we have 80 million subscribers, we work with both potential customers and those who have left us if it is necessary to collect receivables. Therefore, the real volumes are much larger.

2 billion transactions the balance changes daily - these are payments, accruals, calls and other events. 200 TB of data is actively changing, change a little more slowly 8 Pb of data, and this is not an archive, but live data in a single billing. Scale by data center β€” 5 thousand servers at 14 sites.

Technology stack

When we planned the architecture and undertook to assemble the system, we imported the most interesting and advanced technologies for ourselves. The result was a technological stack familiar to any Internet player and corporations that make high-load systems.

New generation billing architecture: transformation with the transition to Tarantool

The stack is similar to the stacks of other big players: Netflix, Twitter, Viber. It consists of 6 components, but we want to reduce and unify it.

Flexibility is good, but in a large corporation there is no way without unification.

We are not going to change the same Oracle for Tarantool. In the realities of large companies, this is a utopia, or a crusade for 5-10 years with an incomprehensible outcome. But Cassandra and Couchbase can be completely replaced by Tarantool, and we are striving for this.

Why Tarantool?

There are 4 simple criteria why we chose this database.

Speed. We conducted load tests on MegaFon industrial systems. Tarantool won - it showed the best performance.

It cannot be said that other systems do not meet the needs of MegaFon. Current memory solutions are so productive that the company's stock is more than enough. But it is interesting for us to deal with the leader, and not with the one who trails behind, including in the stress test.

Tarantool meets the needs of the company even in the long run.

TCO cost. Couchbase support on MegaFon volumes costs space money, while with Tarantool the situation is much nicer, and they are close in functionality.

Another nice feature that slightly influenced our choice is that Tarantool works better with memory than other databases. He shows maximum efficiency.

Reliability. MegaFon invests in reliability like no other. Therefore, when we looked at Tarantool, we realized that we need to make sure that it meets our requirements.

We invested our time and money, and together with Mail.ru we created an enterprise version, which is now used by several other companies.

Tarantool-enterprise completely satisfied us in terms of security, reliability, and logging.

Sponsors

The most important thing for me is direct contact with the developer. This is exactly what the guys from Tarantool bribed.

If you come to a player, especially one who works with an anchor client, and say that you need the database to be able to do this, this and that, usually he will answer:

β€œAll right, put the requirements at the bottom of that pileβ€”we’ll probably get to them someday.”

Many have a roadmap for the next 2-3 years, and it is almost impossible to fit in there, while Tarantool developers bribe with openness, and not only with MegaFon, and adapt their system to the customer. It's cool and we love it.

Where we applied Tarantool

We have Tarantool used in several elements. The first is in the pilot, which we made on the address directory system. At one time, I wanted it to be a system that is similar to Yandex.Maps and Google Maps, but it turned out a little differently.

For example, the address catalog in the sales interface. On Oracle, finding the right address takes 12-13 seconds. - uncomfortable numbers. When we switch to Tarantool, replace Oracle with another database in the console, and perform the same search, we get a 200x speedup! The city pops up after the third letter. Now we are adapting the interface so that this happens after the first one. However, the response speed is completely different - already milliseconds instead of seconds.

The second application is a trendy theme called two-speed IT. That's because the consultants from every iron say that corporations should go there.

New generation billing architecture: transformation with the transition to Tarantool

There is an infrastructure layer here, above it are domains, for example, a billing system, like a telecom, corporate systems, corporate reporting. This is the core, which should not be touched. That is, of course, it is possible, but paranoidly ensuring quality, because it brings money to the corporation.

Next comes the layer of microservices - what differentiates the operator or another player. Microservices can be quickly created based on some caches, raising data from different domains there. Here field for experiments - if something did not work out, closed one microservice, opened another. This provides a truly improved time-to-market and increases the reliability and speed of the company.

Microservices are, perhaps, the main role of Tarantool in MegaFon.

Where do we plan to use Tarantool

If we compare our successful billing project with the transformation programs at Deutsche Telekom, Svyazcom, Vodafone India, it is amazingly dynamic and creative. In the process of implementing this project, not only MegaFon and its structure were transformed, but also Tarantool-enterprise appeared at Mail.ru, and our vendor Nexign (formerly Peter-Service) had a BSS Box (a boxed billing solution).

This is, in a sense, a historical project for the Russian market. It can be compared with what is described in the book by Frederick Brooks "The Mythical Man-Month". Back then, in the 60s, IBM hired 360 people to develop the new OS/5 operating system for the mainframe. We have less - 000, but ours are in vests, and taking into account the use of open source and new approaches, we work more productively.

Below are the domains of billing or, more broadly, business systems. Enterprise people know CRM very well. Everyone should already have other systems: Open API, API Gateway.

New generation billing architecture: transformation with the transition to Tarantool

Open API

Let's look at the numbers again and how the Open API works now. His load is 10 transactions per second. Since we plan to actively develop the microservices layer and build MegaFon's public API, we expect more growth in the future in this particular part. 100 transactions will definitely be.

I don’t know if we can compare SSO with Mail.ru - the guys seem to have 1 transactions per second. We are extremely interested in their solution and we plan to learn from their experience - for example, to make a functional reserve of SSO using Tarantool. Now the developers from Mail.ru are doing this with us.

CRM

CRM is the same 80 million subscribers that we want to bring to a billion, because there are already 300 million documents that include a three-year history. We are really looking forward to new services, and here growth point is connected services. This is a ball that will grow, because there will be more and more services. Accordingly, history will be needed, we do not want to stumble on this.

Billing itself in terms of invoicing, work with customer receivables transformed into a separate domain. To improve performance, domain architecture architectural pattern applied.

The system is divided into domains, the load is distributed and fault tolerance is provided. Additionally, we worked with a distributed architecture.

Everything else is enterprise level solutions. Call storage - 2 billion per day, 60 billion per month. Sometimes you have to count them for a month, and it's better to quickly. Financial monitoring - these are exactly the same 300 million that are constantly growing and growing: subscribers often run between operators, increasing this part.

The most telecom component of mobile communication is online billing. These are the systems that allow you to call or not to call, make a decision in real time. Here the load is 30 transactions per second, but given the growth in data transfer, we plan 250 transactions, and therefore we are very interested in Tarantool.

The previous picture is the domains where we are going to apply Tarantool. CRM itself is, of course, wider and we are going to apply it in the core.

Our calculated TTX figure of 100 million subscribers confuses me as an architect - but what if 101 million? Redo everything again? To prevent this, we use caches, at the same time increasing availability.

New generation billing architecture: transformation with the transition to Tarantool

In general, there are two approaches to using Tarantool. First - build all caches at the microservice level. As far as I understand, VimpelCom is following this path, creating a client cache.

We are less dependent on vendors, we are changing the core of BSS, so we have a single card file of clients already out of the box. But we want to expand it. Therefore, we take a slightly different approach - make caches inside systems.

So there is less desynchronization - one system is responsible for both the cache and the main master source.

The method fits well with the Tarantool approach with a transactional skeleton, when only parts that relate to updates, that is, data changes, are updated. Everything else can be stored somewhere else. There is no huge data lake, unmanaged global cache. Caches are designed for the system, or for products, or for customers, or to make life easier for maintenance. When a subscriber who is upset with the quality calls, I want to serve him with high quality.

RTO and RPO

There are two terms in IT βˆ’ RTO ΠΈ RPO.

Recovery time objective is the recovery time of the service after a failure. RTO = 0 means that even if something has fallen, the service continues to work.

recovery point objective is the data recovery time, how much data we can lose over a period of time. RPO = 0 means that we do not lose data.

Tarantool task

Let's try to solve a task for Tarantool.

I hope: a clear basket of applications for everyone, for example, in Amazon or somewhere else. Required for the shopping cart to work 24 hours, 7 days a week, or 99,99% of the time. The orders that come to us must be kept in order, because we cannot randomly turn on or off the subscriber's connection - everything must be strictly sequential. The previous subscription affects the next, so the data is important - nothing should be lost.

Solution. You can try to solve it head-on and ask the database developers, but the problem cannot be solved mathematically. You can recall theorems, conservation laws, quantum physics, but why - it cannot be solved at the database level.

The good old architectural approach works here - you need to know the subject area well and solve this puzzle at its expense.

New generation billing architecture: transformation with the transition to Tarantool

Our Solution: Creating a Distributed Tarantool Ticket Registry β€” a Geo-Distributed Cluster. In the diagram, these are three different data processing centers - two to the Urals, one beyond the Urals, and we distribute all applications to these centers.

Netflix, which is now considered one of the leaders in IT, had only one data center until 2012. On the eve of Catholic Christmas on December 24, this data center lay down. Users in Canada and the United States were left without their favorite films, they were very upset and wrote about it on social networks. Netflix now has three data centers on the west-east coast and one in western Europe.

We initially build a geo-distributed solution - fault tolerance is important to us.

So, we have a cluster, but what about RPO = 0 and RTO = 0? The solution is simple, which depends on the subject matter.

What is important in applications? Two parts: throwing the basket BEFORE making a purchase decision, and AFTER. The DO part in telecom is usually called order capturing or order negotiation. In telecom, this can be much more difficult than in an online store, because there the client needs to be served, offered 5 options, and this all happens for some time, but the basket is filled. At this point, a failure is possible, but it's not scary, because it happens interactively under the supervision of a person.

If the Moscow data center suddenly fails, then automatically switching to another data center, we will continue to work. Theoretically, one product in the basket can be lost, but you see it, add to the basket again and continue to work. In this case RTO = 0.

At the same time, there is a second option: when we clicked "submit", we want the data not to be lost. From this moment, automation begins to work - this is already RPO = 0. The use of these two different patterns in one case can be just a geo-distributed cluster with one switchable master, in the other case, some kind of quorum record. Templates may vary, but we are solving the problem.

Further, having a distributed registry of applications, we can also scale it all - have many dispatchers and executors who access this registry.

New generation billing architecture: transformation with the transition to Tarantool

Cassandra and Tarantool Together

There is another case - "showcase of balances". Here is just an interesting case of the combined use of Cassandra and Tarantool.

We use Cassandra because 2 billion calls per day is not the limit, and there will be more. Marketers like to color traffic by source, more and more details appear on social networks, for example. It all adds to the story.

Cassandra allows you to scale horizontally to any volume.

We feel comfortable with Cassandra, but she has one problem - she is not good at reading. Everything is OK on the record, 30 per second is not a problem - problem in reading.

Therefore, the issue with the cache appeared, and at the same time we solved the following problem: there is an old traditional case, when the equipment from the switch from online billing comes to the files that we upload to Cassandra. We have struggled with the problem of reliable download of these files, even applied on the advice of IBM manager file transfer - there are solutions that manage file transfer efficiently using the UDP protocol, for example, and not TCP. This is good, but it’s still minutes, and until we upload it all, the operator in the call center cannot answer the client, what happened to his balance - we have to wait.

To prevent this from happening, we apply parallel functional reserve. When we send an event via Kafka to Tarantool, recalculating aggregates in real time, for example, for today, we get cash balances, which can give balances at any speed, for example, 100 thousand transactions per second and those same 2 seconds.

The goal is that after making a call, after 2 seconds, in your personal account there will be not only a changed balance, but information about why it has changed.

Conclusion

These were examples of using Tarantool. We really liked the openness of Mail.ru, their willingness to consider different cases.

It is already difficult for consultants from BCG or McKinsey, Accenture or IBM to surprise us with something new - much of what they offer, we either already do, or have done, or plan to do. I think that Tarantool will take its rightful place in our technology stack and replace many existing technologies. We are in the active phase of the development of this project.

The report of Oleg and Andrey is one of the best at the Tarantool Conference last year, and on June 17 Oleg Ivlev will speak at T+ Conference 2019 with a report "Why Tarantool in Enterprise". Also, Alexander Deulin will make a presentation from MegaFon "Tarantool Caches and Oracle Replication". Find out what has changed, what plans have been implemented. Join - the conference is free, you just need to sign up. All reports accepted and the conference program is formed: new cases, new experience of using Tarantool, architecture, enterprise, tutorials and microservices.

Source: habr.com

Add a comment