Megapack: How Factorio Solved the 200-Player Multiplayer Problem

Megapack: How Factorio Solved the 200-Player Multiplayer Problem
In May of this year, I participated as a player in KatherineOfSky MMO Events. I noticed that when the number of players reaches a certain number, every few minutes some of them "fall off". Luckily for you (but not for me), I was one of those players everytimeeven with a good connection. I took it as a personal challenge and began to look for the causes of the problem. After three weeks of debugging, testing, and fixing, the bug is finally fixed, but the journey hasn't been all that easy.

Problems in multiplayer games are very difficult to track down. They usually occur under very specific network parameters and under very specific game states (in this case, over 200 players). And even when a problem can be reproduced, it can't be properly debugged because inserting breakpoints stops the game, messes up timers, and usually causes the connection to time out due to a timeout. But thanks to perseverance and a wonderful tool called clumsy I was able to figure out what's going on.

In short, due to a bug and incomplete implementation of the delay state simulation, the client sometimes found itself in a situation where it had to send a network packet in one clock cycle, consisting of player-input actions for selecting approximately 400 game entities (we call it a "megapacket"). After that, the server not only needs to correctly receive all these input actions, but also send them to all other clients. If you have 200 clients, this quickly becomes a problem. The channel to the server quickly becomes clogged, resulting in lost packets and a cascade of re-requested packets. Postponing input actions then causes more clients to start sending megapackets, and their avalanche gets even stronger. Successful clients manage to recover, all the rest fall off.

Megapack: How Factorio Solved the 200-Player Multiplayer Problem
The problem was quite fundamental, and it took me 2 weeks to fix it. It's pretty technical, so I'll explain the juicy technical details below. But first, you need to know that since version 0.17.54, released on June 4, in the face of temporary connection problems, multiplayer has become more stable, and delay hiding is much less buggy (less braking and teleporting). Also, I've changed the way combat delays are hidden, and hopefully this will make them a little smoother.

Multiplayer Mega Pack - Technical Details

To put it simply, multiplayer in a game works like this: all clients simulate the state of the game by receiving and sending only player input (called "input actions" Input Actions). The main task of the server is to transfer Input Actions and ensuring that all clients perform the same actions in the same cycle. You can read more about this in the post. FFF-149.

Since the server has to make decisions about what actions to take, the player's actions move along the following path: player action -> game client -> network -> server -> network -> game client. This means that each action of the player is performed only after it has made a round-trip path through the network. Because of this, the game would have seemed terribly slow, so almost immediately after the appearance of multiplayer in the game, a mechanism for hiding delays was introduced. Latency hiding simulates player input without considering the actions of other players and server decision making.

Megapack: How Factorio Solved the 200-Player Multiplayer Problem
Factorio has a game state game state is the complete state of the map, player, entities, and everything else. It is deterministically simulated in all clients based on actions received from the server. The game state is sacred, and if it ever starts to differ from the server or any other client, then desynchronization occurs.

But game state we have a state of delays Latency State. It contains a small subset of the main state. Latency State is not sacred and just represents a picture of what the state of the game will look like in the future based on the inputs from the player Input Actions.

To do this, we keep a copy of the generated Input Actions in the delay queue.

Megapack: How Factorio Solved the 200-Player Multiplayer Problem
That is, at the end of the process on the client side, the picture looks something like this:

  1. Apply Input Actions all players to game state the way these input actions were received from the server.
  2. Remove everything from the delay queue Input Actions, which, according to the server, have already been applied to game state.
  3. Delete Latency State and reset it so it looks exactly the same as game state.
  4. Apply all actions from the delay queue to Latency State.
  5. Based on data game state ΠΈ Latency State render the game to the player.

All this is repeated in every beat.

Too difficult? Don't relax, that's not all. To compensate for unreliable Internet connections, we have created two mechanisms:

  • Skipped ticks: when the server decides that Input Actions will be executed in the tact of the game, then if he has not received Input Actions some player (for example, due to an increased delay), he will not wait, but will inform this client β€œI did not take into account your Input Actions, I'll try to add them in the next bar. This is done so that due to problems with the connection (or with the computer) of one player, the map update does not slow down for everyone else. It is worth noting that Input Actions are not ignored, but simply postponed.
  • Full round-trip latency: The server tries to guess what the round-trip latency between client and server is for each client. Every 5 seconds, it negotiates a new delay with the client as needed (depending on how the connection has behaved in the past), and increases or decreases the round trip delay accordingly.

By themselves, these mechanisms are quite simple, but when they are used together (which often happens with connection problems), the code logic becomes difficult to manage and with a lot of edge cases. In addition, when these mechanisms come into play, the server and the delay queue must correctly implement a special input action called StopMovementInTheNextTick. Thanks to this, in case of connection problems, the character will not run on his own (for example, under a train).

Now I need to explain to you how entity selection works. One of the passed types input action is a change in the selection state of an entity. It tells everyone which entity the player hovered over with the mouse. As you can see, this is one of the most frequent input actions sent by clients, so to save bandwidth, we have optimized it so that it takes up as little space as possible. This is implemented like this: when each entity is selected, instead of storing absolute, high-precision map coordinates, the game stores a low-precision relative offset from the previous selection. This works well because the mouse selection usually happens very close to the previous selection. This gives rise to two important requirements: Input Actions should never be skipped and must be done in the correct order. These requirements are met for game state. But since the task latency state in "looking good enough" for the player, they are not satisfied in the delay state. Latency State does not take into account many borderline casesassociated with skipping clocks and changing round-trip transmission delays.

You can already guess where this is going. Finally, we are starting to see the causes of the megapackage problem. The root of the problem is that the entity selection logic relies on Latency State, and this state does not always contain the correct information. So the megapacket is generated like this:

  1. The player is experiencing connection issues.
  2. The mechanisms for skipping cycles and regulating the round-trip transmission delay come into play.
  3. The delay state queue does not account for these mechanisms. This causes some actions to be removed prematurely or run in the wrong order, resulting in an incorrect Latency State.
  4. The player has no connection problem and simulates up to 400 cycles to catch up with the server.
  5. In each cycle, a new action is generated and prepared to be sent to the server, changing the entity selection.
  6. The client sends a megapacket of 400+ entity selection changes to the server (and with other actions: firing state, walking state, etc. also suffered from this problem).
  7. The server receives 400 input actions. Since it is not allowed to skip a single input action, it instructs all clients to perform these actions and sends them over the network.

The irony is that a mechanism designed to conserve bandwidth resulted in huge network packets.

We've resolved this issue by fixing all update edge cases and delay queue support. Although it took quite some time, it was worth getting it right in the end rather than relying on quick hacks.

Source: habr.com

Add a comment