From Skype to WebRTC: how we organized video communication over the web

From Skype to WebRTC: how we organized video communication over the web

Video communication is the main way of communication between a teacher and a student on the Vimbox platform. We abandoned Skype a long time ago, tried several third-party solutions, and eventually settled on the WebRTC - Janus-gateway bundle. For some time everything suited us, but still some negative points continued to come out. As a result, a separate video direction was created.

I asked Kirill Rogovoy, the head of the new direction, to talk about the evolution of video communication in Skyeng, the problems discovered, solutions and crutches that we eventually applied. We hope the article will be useful for companies that also raise videos on their own through a web application.

A bit of history

In the summer of 2017, the head of Skyeng development, Sergey Safonov, spoke at Backend Conf with a story about how we “abandoned Skype and implemented WebRTC”. Those interested can watch the recording of the speech on link (~45 min), and here I will briefly outline its essence.

For Skyeng School, video communication has always been a priority method of teacher-student communication. At first, Skype was used, but it categorically did not suit for a number of reasons, primarily due to the lack of logs and the impossibility of integrating directly into the web application. So we did all sorts of experiments.

Actually, we had the following requirements for video communication:
- stability;
- low price per lesson;
- recording lessons;
- tracking who speaks how much (it is important for us that students speak more than the teacher in the lessons);
— linear scaling;
- the ability to use both UDP and TCP.

The first in 2013 tried to implement Tokbox. Everything was good, but it turned out to be very expensive - 113 rubles per lesson - and ate up the profit.

Then in 2015 Voximplant was integrated. Here was the tracking function we needed, who says how much, and at the same time the solution was much cheaper: if only the sound was recorded, it came out to 20 rubles per lesson. However, it worked only through UDP, it was not possible to switch to TCP. However, in the end, about 40% of the students used it.

A year later, we began to have corporate clients with their own specific requirements. For example, everything should work through a browser, only http and https are open in the company; i.e. no Skype and UDP. Corporate customers = money, so they returned to Tokbox, but the price problem has not gone away.

Solution - WebRTC and Janus

Decided to use browser-based platform for peer-to-peer WebRTC video communication. It is responsible for establishing a connection, encoding and decoding streams, synchronizing tracks and quality control with handling network glitches. For our part, we must ensure that streams are read from the camera and microphone, video is rendered, connection is managed, a WebRTC connection is established and streams are sent to it, as well as signaling messages are sent between clients to establish a connection (WebRTC itself describes only the data format, but not their mechanism). transmission). In case the clients are behind NAT, WebRTC connects STUN servers, if that doesn't help, TURN servers.

The usual p2p connection is not enough for us, because we want to record lessons for further analysis in case of complaints. Therefore, we send WebRTC streams through a relay Janus Gateway by Meetecho. As a result, clients do not know each other's addresses, seeing only the address of the Janus server; it also performs the functions of a signal server. Janus has a lot of features we need: it automatically switches to TCP if the client has UDP blocked; can record both UDP and TCP streams; scaled; there is even a built-in plugin for echo tests. If necessary, STUN and TURN servers from Twilio are automatically connected.

In the summer of 2017, we had two Janus servers working, plus an additional server for processing recorded raw audio and video files, so as not to take up the processors of the main ones. When connected, the Janus servers were selected on an even-odd basis (connection number). At that time, this was enough, according to our feelings, it gave about a fourfold margin of safety, the percentage of implementation was about 80. At the same time, the price was reduced to ~ 2 rubles per lesson, plus development and support.

From Skype to WebRTC: how we organized video communication over the web

Return to the topic of video communication

We constantly monitor feedback from students and teachers in order to identify and stop problems in time. By the summer of 2018, the quality of communication was confidently fixed in the first place among complaints. On the one hand, this meant that we successfully coped with other shortcomings. On the other hand, it was necessary to urgently do something: if the lesson is disrupted, we risk losing its cost, sometimes along with the cost of buying the next package, and if the introductory lesson is disrupted, we may lose a potential client altogether.

At that time, our video communication was still in MVP mode. Simply put, they launched it, it worked, scaled it once, understood how to do it - well, that's nice. If it works, don't fix it. No one purposefully dealt with the issue of communication quality. By August, it became clear that this could not continue, and we launched a separate direction to figure out what was wrong with WebRTC and Janus.

At the input, this direction received: MVP solution, no metrics, no goals, no improvement processes, while 7% of teachers complain about the quality of communication (there was also no data on students).

From Skype to WebRTC: how we organized video communication over the web

New direction gets to work

The command looks something like this:

  • The head of the direction, he is also the main developer.
  • QA help test changes, look for new ways to create unstable communication conditions, report problems from the front line.
  • The analyst is constantly looking for different correlations in technical data, improves the analysis of user feedback, checks the results of experiments.
  • The product manager helps with the overall direction and allocation of resources for experiments.
  • A second developer often helps with the programming itself and related tasks.

To begin with, we set up a relatively reliable metric that tracked changes in the assessment of communication quality (average over days, weeks, months). At that time, these were marks from teachers, later they added marks from students. Then they began to build hypotheses about what works wrong, correct it and look at changes in dynamics. We went for low-hanging fruit: for example, we replaced the vp8 codec with vp9, the performance improved. We tried to play with the Janus settings, to conduct other experiments - in most cases they did not lead to anything.

At the second stage, a hypothesis appeared: WebRTC is a peer-to-peer solution, and we use a server in the middle. Perhaps the problem lies here? We began to dig and found here the most significant improvement so far.

At that moment, a server from the pool was selected according to a rather stupid algorithm: each had its own “weight”, depending on the channel and power, and we tried to send the user to the one where the “weight” is greater, not paying attention to where the user is geographically located . As a result, a teacher from St. Petersburg could communicate with a student from Siberia through Moscow, and not through our Janus server in St. Petersburg.

The algorithm has been changed: now, when a user opens our platform, we use Ajax to collect pings from him to all servers. When establishing a connection, we choose a pair of pings (teacher-server and student-server) with the smallest sum. Less ping - less network distance to the server; less distance - lower probability of losing packets; Packet loss is the biggest negative factor in video communications. The share of negatives has halved in three months (to be fair, other experiments were being conducted at that time, but this one almost certainly had the most impact).

From Skype to WebRTC: how we organized video communication over the web

From Skype to WebRTC: how we organized video communication over the web

We recently discovered another non-obvious, but, apparently, important thing: instead of one powerful Janus server on a thick channel, two simpler ones with thinner bandwidth are better. It turned out after we bought powerful machines in the hope of stuffing as many rooms (communication sessions) as possible at the same time. Servers have a bandwidth limit, which we can accurately translate into the number of rooms - we know how much can be opened, for example, at 300 Mbps. As soon as too many rooms are open on the server, we stop choosing it for new activities until the load decreases. The idea was that, having bought a powerful machine, we would load the channel up to it to the maximum, so that in the end it would be limited to the processor and memory, and not to the bandwidth. But it turned out that after a certain number of open rooms (420), despite the fact that the load of the processor, memory and disk is still very far from the limits, the negative begins to arrive in technical support. Apparently, something is getting worse inside Janus, perhaps there are some restrictions there too. They began to experiment, lowered the bandwidth limit from 300 to 200 Mbps, the problems went away. Now we bought three new servers at once with low limits and characteristics, we think that this will lead to a stable improvement in the quality of communication. Of course, we did not begin to understand what was the matter there, crutches are our everything. In our defense, we will say that at that moment it was necessary to solve the urgent problem as quickly as possible, and not to do it beautifully; besides, Janus is a black box for us, written in C, it is very expensive to dig with it.

From Skype to WebRTC: how we organized video communication over the web

Well, in the process we:

  • updated all dependencies that could be updated, both on the server and on the client (these were also experiments, we monitored the result);
  • fixed all identified bugs related to specific cases, for example, when the connection fell and was not restored automatically;
  • held a lot of meetings with companies working in the field of video communication and familiar with our problems: streaming games, arranging webinars; tried everything that seemed useful to us;
  • conducted a technical review of the hardware and the quality of communication among the teachers from whom the most complaints came.

The experiments carried out and the changes that followed made it possible to reduce dissatisfaction with communication among teachers from 7,1% in January 2018 to 2,5% in January 2019.

What's next

The stabilization of our Vimbox platform is one of the company's main projects for 2019. We have high hopes that we will be able to maintain momentum and no longer see video calls in the top complaints. We understand that a significant portion of these complaints are related to user computer and internet lag, but we need to identify that portion and address the rest. Everything else is a technical problem, it seems that we should be able to deal with it.

The main difficulty is that we do not know to what level it is actually possible to improve the quality. Finding out this ceiling is the main task. Therefore, two experiments were planned:

  1. compare video through Janus with regular p2p in combat. This experiment has already been carried out, no statistically significant difference was found between our solution and p2p;
  2. let's put (expensive) services from companies that earn exclusively on video communications solutions, and compare the amount of negative from them with the existing one.

These two experiments will allow us to identify an achievable goal and focus on it.

In addition, there are a number of tasks to be solved in working order:

  • we create a technical metric of communication quality instead of subjective feedback;
  • we make more detailed session logs in order to more accurately analyze the failures that occur, to understand when and where exactly they occurred, what seemingly unrelated events took place at that moment;
  • we are preparing an automatic connection quality test before the lesson, and we will also give the client the opportunity to manually test the connection in order to reduce the amount of negativity caused by his hardware and channel;
  • develop and run more video load tests under poor conditions, with variable packet loss, etc.;
  • we change the behavior of servers in case of problems to increase fault tolerance;
  • we will warn the user if something is wrong with the connection at all, as Skype does, so that he understands that the problem is on his side.

Since April, the video communication direction has become a full-fledged separate project within Skyeng, dealing with its own product, not just a part of Vimbox. And this means that we are starting to look for people on work with video in full-time mode. Well, as always looking for a lot of good people.

And, of course, we continue to actively communicate with people and companies working with video communications. If you want to exchange experience with us, we will be glad! Comment, contact - we will answer everyone.

Source: habr.com