RabbitMQ vs. Kafka: fault tolerance and high availability

RabbitMQ vs. Kafka: fault tolerance and high availability

В last article we looked at RabbitMQ clustering for fault tolerance and high availability. Now let's dig deep into Apache Kafka.

Here, the unit of replication is a partition. Each topic has one or more sections. Each section has a leader with or without followers. When creating a topic, the number of partitions and the replication factor are specified. The usual value is 3, which means three lines: one leader and two followers.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 1. Four sections distributed among three brokers

All read and write requests go to the leader. Followers periodically send requests to the leader to receive the latest messages. Consumers never access followers, the latter exist only for redundancy and fault tolerance.

RabbitMQ vs. Kafka: fault tolerance and high availability

Partition failure

When a broker falls off, the leaders of several sections often fail. In each of them, a follower from another node becomes the leader. In fact, this is not always the case, since the synchronization factor also affects: are there synchronized followers, and if not, is the transition to an unsynchronized replica allowed. But let's not complicate things just yet.

Broker 3 leaves the network, and a new leader is elected for section 2 at broker 2.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 2. Broker 3 dies and his follower on broker 2 is elected as the new leader of section 2

Then broker 1 leaves and section 1 also loses its leader, whose role passes to broker 2.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 3. One broker left. All leaders are on the same broker with zero redundancy

When Broker 1 comes back online, it adds four followers, providing some redundancy to each partition. But all the leaders still remained on broker 2.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 4. Leaders stay on Broker 2

When broker 3 comes up, we are back to three replicas per partition. But all leaders are still on broker 2.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 5. Unbalanced placement of leaders after the restoration of brokers 1 and 3

Kafka has a tool for better leader rebalancing than RabbitMQ. There, you had to use a third-party plugin or script that changed policies for the master node migration by reducing redundancy during the migration. In addition, for large queues, you had to put up with unavailability during synchronization.

Kafka has the concept of "preferred cues" for the leader role. When sections of a topic are created, Kafka attempts to evenly distribute the leaders across the nodes and marks those first leaders as preferred. Over time, due to server reboots, crashes, and outages, leaders can end up on other nodes, as in the extreme case above.

To fix this, Kafka offers two options:

  • Option auto.leader.rebalance.enable=true allows the controller node to automatically reassign leaders back to preferred replicas and thereby restore even distribution.
  • Administrator can run the script kafka-preferred-replica-election.sh for manual reassignment.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 6. Replicas after rebalancing

This was a simplified version of the failure, but the reality is more complex, although nothing too complicated here. It all comes down to synchronized replicas (In-Sync Replicas, ISR).

Synchronized Replicas (ISRs)

An ISR is a replica set of a partition that is considered "in-sync". There is a leader here, but there may not be followers. A follower is considered synchronized if he made exact copies of all the leader's messages before the interval expired replica.lag.time.max.ms.

A follower is removed from the ISR set if it:

  • did not make a request for a sample for the interval replica.lag.time.max.ms (presumed dead)
  • failed to update during the interval replica.lag.time.max.ms (considered slow)

Followers make fetch requests in the interval replica.fetch.wait.max.ms, which is 500ms by default.

To clearly explain the purpose of an ISR, one needs to look at confirmations from the producer and some failure scenarios. Manufacturers can choose when the broker sends confirmation:

  • acks=0, no acknowledgment is sent
  • acks=1, acknowledgment is sent after the leader has written the message to its local log
  • acks=all, acknowledgment is sent after all replicas in the ISR have written the message to the local logs

In Kafka terminology, if an ISR has stored a message, it is "committed". Acks=all is the safest option, but it comes with additional delay. Let's look at two failure examples and how the different 'acks' options interact with the ISR concept.

Acks=1 and ISR

In this example, we will see that if the leader does not wait for each message to be saved from all followers, then data loss is possible if the leader fails. Jumping to an unsynchronized follower can be enabled or disabled by setting unclean.leader.election.enable.

In this example, the manufacturer has acks=1. The section is distributed across all three brokers. Broker 3 is behind, it synced with the leader eight seconds ago and is now 7456 messages behind. Broker 1 was only one second behind. Our producer sends a message and gets an ack back quickly, with no overhead on slow or dead followers that the leader isn't expecting.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 7. ISR with three replicas

Broker 2 goes down and the producer gets a connection error. After the transition of leadership to broker 1, we lose 123 messages. The follower on broker 1 was in the ISR but not fully in sync with the leader when it fell.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 8. Messages are lost when it crashes

In configuration bootstrap.servers the manufacturer has several brokers listed and can ask another broker who is the new partition leader. It then establishes a connection to broker 1 and continues to send messages.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 9. Sending messages resumes after a short break

Broker 3 is even further behind. It makes fetch requests but can't sync. This may be due to a slow network connection between brokers, a storage issue, etc. It is being removed from the ISR. Now the ISR consists of one replica - the leader! The manufacturer continues to send messages and receive confirmations.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 10. Follower on Broker 3 is removed from ISR

Broker 1 goes down and Broker 3 takes the lead with a loss of 15286 messages! The producer receives a connection error message. The transition to the leader outside the ISR was only possible due to the setting unclean.leader.election.enable=true. If it is installed in false, then the transition would not occur, and all read and write requests would be rejected. In this case, we are waiting for broker 1 to return with its intact data in the replica, which will again take the lead.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 11. Broker 1 goes down. When it crashes, a lot of messages are lost

The producer establishes a connection with the last broker and sees that it is now the partition leader. It starts sending messages to broker 3.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 12. After a short break, messages are sent back to section 0

We have seen that apart from brief interruptions to establish new connections and search for a new leader, the producer was constantly sending messages. This configuration provides availability at the expense of consistency (data security). Kafka lost thousands of messages but continued to accept new records.

Acks=all and ISR

Let's repeat this scenario again, but with acks=all. Broker 3 latency is four seconds on average. The manufacturer sends a message with acks=all, and now does not receive a prompt response. The leader waits for the message to be stored by all replicas in the ISR.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 13. ISR with three replicas. One is slow, resulting in recording delays

After four seconds of additional delay, Broker 2 sends an ack. All replicas are now fully updated.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 14. All replicas save messages and send ack

Broker 3 is now further behind and removed from the ISR. Latency is greatly reduced as there are no slow replicas left in the ISR. Broker 2 is now only waiting for Broker 1, which has an average lag of 500ms.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 15. Replica on broker 3 is removed from the ISR

Broker 2 then goes down, and the lead goes to broker 1 without message loss.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 16. Broker 2 crashes

The manufacturer finds a new leader and starts sending him messages. The latency is further reduced, because now the ISR consists of a single replica! Therefore the option acks=all does not add redundancy.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 17. Replica on broker 1 takes the lead without losing messages

Then Broker 1 crashes and Broker 3 takes the lead with a loss of 14238 messages!

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 18. Broker 1 dies and leadership transition with unclean setting results in massive data loss

We could not set the option unclean.leader.election.enable in value true. By default it is false. Setting acks=all с unclean.leader.election.enable=true provides accessibility with some additional data security. But as you can see, we can still lose messages.

But what if we want to increase data security? Can be put unclean.leader.election.enable = false, but this will not necessarily protect us from data loss. If the leader fell hard and took the data with him, then the messages are still lost, plus the availability is lost until the administrator restores the situation.

It is better to guarantee the redundancy of all messages, and otherwise refuse to write. Then, at least from the point of view of the broker, data loss is possible only with two or more simultaneous failures.

acks=all, min.insync.replicas and ISR

With topic configuration min.insync.replicas we increase the level of data security. Let's go through the last part of the previous scenario again, but this time with min.insync.replicas=2.

So, broker 2 has a replica leader, and the follower on broker 3 has been removed from the ISR.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 19. ISR of two replicas

Broker 2 goes down and the lead goes to broker 1 without message loss. But now the ISR consists of only one replica. This does not meet the minimum number to receive entries, and therefore the broker responds to a write attempt with an error NotEnoughReplicas.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 20. Number of ISRs one less than min.insync.replicas

This configuration sacrifices availability for consistency. Before acknowledging a message, we ensure that it is written to at least two replicas. This gives the manufacturer much more confidence. Here, message loss is possible only if two replicas fail at the same time in a short interval until the message is replicated to an additional follower, which is unlikely. But if you're super paranoid, you can set the replication factor to 5, and min.insync.replicas by 3. Here, three brokers must fall at the same time in order to lose the record! Of course, for such reliability, you will pay with additional delay.

When availability is essential for data security

As in case with RabbitMQ, sometimes accessibility is necessary for data security. Here's what you need to think about:

  • Can the publisher simply return an error and have the upstream service or user try again later?
  • Can the publisher save the message locally or in the database to try again later?

If the answer is no, then optimizing availability improves data security. You will lose less data if you choose availability over non-write. Thus, it all comes down to finding a balance, and the decision depends on the specific situation.

Meaning of ISR

The ISR set allows you to choose the optimal balance between data security and latency. For example, to ensure availability in the event of a failure of most replicas, minimizing the impact of dead or slow replicas in terms of latency.

We choose the meaning replica.lag.time.max.ms according to your needs. In essence, this parameter means what kind of delay we are ready to accept when acks=all. The default value is ten seconds. If this is too long for you, you can reduce it. Then the frequency of changes in the ISR will increase, as followers will be removed and added more often.

RabbitMQ is just a set of mirrors that need to be replicated. Slow mirrors introduce additional latency, and dead mirrors can be expected to respond before the lifetime of packets that check the availability of each node (net tick) expires. ISR is an interesting way to avoid these latency problems. But we risk losing redundancy, since ISR can only be reduced to the leader. To avoid this risk, use the setting min.insync.replicas.

Customer Connectivity Guarantee

In settings bootstrap.servers producer and consumer, you can specify multiple brokers to connect clients. The idea is that when one node goes down, there are several spare nodes with which the client can open a connection. These are not necessarily partition leaders, but simply a springboard for bootstrapping. The client can ask them which node hosts the read/write partition leader.

In RabbitMQ, clients can connect to any node, and internal routing sends the request to the right place. This means that you can install a load balancer in front of RabbitMQ. Kafka requires clients to connect to the node hosting the corresponding partition leader. In such a situation, the load balancer should not be installed. List bootstrap.servers It is critical that clients can access the correct nodes and find them after a failure.

Kafka consensus architecture

So far, we have not considered how the cluster learns about the fall of the broker and how a new leader is chosen. To understand how Kafka works with network partitions, you first need to understand the consensus architecture.

Each Kafka cluster is deployed along with a Zookeeper cluster, which is a distributed consensus service that allows the system to reach consensus on some given state, prioritizing consistency over availability. Approval of read and write operations requires the consent of a majority of Zookeeper nodes.

Zookeeper stores the state of the cluster:

  • List of topics, sections, configuration, current leader replicas, preferred replicas.
  • Cluster members. Each broker pings to the Zookeeper cluster. If it does not receive a ping within a given period of time, then Zookeeper records the broker as unavailable.
  • Selection of the main and spare nodes for the controller.

The controller node is one of the Kafka brokers that is responsible for electing replica leaders. Zookeeper sends notifications of cluster membership and topic changes to the controller, and the controller must act on these changes.

For example, let's take a new topic with ten partitions and a replication factor of 3. The controller must choose the leader of each partition, trying to optimally distribute the leaders among the brokers.

For each section controller:

  • updates information in Zookeeper about ISR and leader;
  • sends a LeaderAndISRCommand to each broker that hosts a replica of this section, informing the brokers of the ISR and the leader.

When a broker with a leader goes down, Zookeeper sends a notification to the controller, and the controller chooses a new leader. Again, the controller first updates the Zookeeper and then sends a command to each broker notifying them of the leadership change.

Each leader is responsible for a set of ISRs. Setting replica.lag.time.max.ms determines who enters. When the ISR changes, the leader sends the new information to Zookeeper.

Zookeeper is always informed of any changes so that in the event of a failure, leadership will smoothly transition to a new leader.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 21. Kafka Consensus

Replication Protocol

Understanding the details of replication helps you better understand potential data loss scenarios.

Select queries, Log End Offset (LEO) and Highwater Mark (HW)

We considered that followers periodically send fetch requests to the leader. The default interval is 500ms. This differs from RabbitMQ in that in RabbitMQ replication is initiated not by the queue mirror but by the master. Master pushes changes to mirrors.

The leader and all followers retain the Log End Offset (LEO) and the Highwater (HW) label. The LEO mark stores the offset of the last message in the local replica, and the HW stores the offset of the last commit. Keep in mind that for the commit status, the message must be persisted across all ISRs. This means that LEO is usually slightly ahead of HW.

When the leader receives a message, he saves it locally. The follower makes a fetch request by sending their LEO. The leader then sends a packet of messages starting from this LEO and also transmits the current HW. When the leader receives the information that all replicas have saved the message at the given offset, it moves the HW mark. Only the leader can move the HW, and so all followers will know the current value in response to their request. This means followers may be behind the leader both in terms of reports and knowledge of HW. Consumers only receive messages up to the current HW.

Note that "persisted" means written to memory, not to disk. For performance, Kafka synchronizes to disk at a specific interval. RabbitMQ also has such an interval, but it will only send an acknowledgment to the publisher after the master and all mirrors have written the message to disk. The developers of Kafka, for performance reasons, decided to send an ack as soon as the message is written to memory. Kafka is betting that redundancy compensates for the risk of short-term storage of acknowledged messages only in memory.

Leader failure

When a leader falls, Zookeeper notifies the controller, and the controller chooses a new leader replica. The new leader sets a new HW mark according to his LEO. Then followers receive information about the new leader. Depending on the version of Kafka, the follower will choose one of two scenarios:

  1. Truncate the local log to a known HW and send a message request to the new leader after that mark.
  2. Will send a request to the leader to find out the HW at the time he was elected leader, and then truncate the log to this offset. It will then start making periodic fetch requests starting at that offset.

A follower may need to truncate the log for the following reasons:

  • When a leader fails, the first follower in the ISR set registered with Zookeeper wins the election and becomes the leader. All followers in the ISR, although considered "synchronized", may not have received a copy of all messages from the former leader. It is possible that the chosen follower does not have the most up-to-date copy. Kafka guarantees that there is no discrepancy between replicas. Thus, to avoid discrepancies, each follower should truncate its log to the HW value of the new leader at the time of his election. This is another reason why the setting acks=all so important for consistency.
  • Messages are periodically written to disk. If all cluster nodes fail at the same time, then replicas with different offsets will remain on the disks. It is quite possible that when the brokers come back online, the new leader that is elected will be behind their followers, because he was saved to disk before the others.

Cluster Reunion

When rejoining the cluster, the replicas do the same as when the leader fails: they check the leader's replica and truncate their log to its HW (at the time of election). In comparison, RabbitMQ equally treats rejoined nodes as brand new. In both cases, the broker discards any existing state. If automatic synchronization is used, then the master must replicate absolutely all current content to the new mirror in a "and let the world wait" way. During this operation, the master does not accept any read or write operations. This approach creates problems in large queues.

Kafka is a distributed log, and in general it stores more messages than a RabbitMQ queue, where data is removed from the queue after it has been read. Active queues should remain relatively small. But Kafka is a log with its own retention policy, which can set the expiration date to days or weeks. The approach with blocking the queue and full synchronization is absolutely unacceptable for a distributed log. Instead, Kafka followers simply truncate their log to the HW of the leader (at the time of his election) if their copy is ahead of the leader. In the more likely case, when the follower is behind, he simply starts making fetch requests starting from his current LEO.

New or reunited followers start outside the ISR and do not participate in commits. They just work alongside the group, getting messages as fast as they can until they catch up with the leader and enter the ISR. There is no blocking and no need to throw away all your data.

Connectivity Disruption

Kafka has more components than RabbitMQ, so it has a more complex set of behaviors when connectivity is broken in a cluster. But Kafka was originally designed for clusters, so the solutions are very well thought out.

The following are a few scenarios for connectivity failure:

  • Scenario 1. The follower does not see the leader, but still sees the Zookeeper.
  • Scenario 2. The leader does not see any followers, but still sees the Zookeeper.
  • Scenario 3. The follower sees the leader, but does not see the Zookeeper.
  • Scenario 4. The leader sees the followers, but does not see the Zookeeper.
  • Scenario 5: The follower is completely separate from both other Kafka nodes and Zookeeper.
  • Scenario 6: The leader is completely decoupled from both other Kafka nodes and Zookeeper.
  • Scenario 7: The Kafka controller node cannot see another Kafka node.
  • Scenario 8: Kafka controller does not see Zookeeper.

Each scenario has its own behavior.

Scenario 1. The follower does not see the leader, but still sees the Zookeeper

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 22. Scenario 1. ISR of three replicas

The disconnect separates Broker 3 from Brokers 1 and 2, but not from Zookeeper. Broker 3 can no longer send fetch requests. After the time has passed replica.lag.time.max.ms it is removed from the ISR and does not participate in message commits. Once connectivity is restored, it will resume fetch requests and join the ISR when it catches up with the leader. Zookeeper will continue to receive pings and assume that the broker is alive and well.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 23. Scenario 1: A broker is removed from the ISR if no fetch request is received from it within the replica.lag.time.max.ms interval

There is no logical split-brain or node suspension like in RabbitMQ. Instead, redundancy is reduced.

Scenario 2. The leader does not see any followers, but still sees the Zookeeper

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 24. Scenario 2. Leader and two followers

The network connectivity failure separates the leader from the followers, but the broker still sees the Zookeeper. As in the first scenario, the ISR shrinks, but this time only to the leader as all followers stop sending fetch requests. Again, there is no logical separation. Instead, there is a loss of redundancy for new messages until connectivity is restored. Zookeeper continues to receive pings and believes that the broker is alive and well.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 25. Scenario 2. ISR shrinks only to the leader

Scenario 3. The follower sees the leader, but does not see the Zookeeper

The follower splits from the Zookeeper, but not from the broker with the leader. As a result, the follower continues to make fetch requests and be a member of the ISR. Zookeeper no longer receives pings and logs the broker down, but since this is only a follower, there are no consequences after recovery.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 26. Scenario 3. The follower keeps sending fetch requests to the leader

Scenario 4. The leader sees the followers, but does not see the Zookeeper

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 27. Scenario 4. Leader and two followers

The leader is separated from Zookeeper, but not from brokers with followers.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 28. Scenario 4. Leader isolated from Zookeeper

After some time, Zookeeper will register the fall of the broker and notify the controller about it. He will choose a new leader among the followers. However, the original leader will continue to think that he is the leader and will continue to accept entries from acks=1. Followers no longer send fetch requests to him, so he will consider them dead and try to compress the ISR to itself. But since it doesn't have a connection to Zookeeper, it won't be able to do so, and at that point it will refuse to accept further entries.

Messages acks=all will not receive confirmation, because at first the ISR turns on all the replicas, and messages do not reach them. When the original leader tries to remove them from the ISR, it will not be able to do so and will stop receiving any messages altogether.

Clients soon notice the change in leader and start sending records to the new server. Once the network is restored, the original leader sees that it is no longer the leader and truncates its log to the HW value that the new leader had at the time of the failure to avoid log divergence. It will then start sending fetch requests to the new leader. Any records of the original leader that are not replicated to the new leader are lost. That is, messages will be lost that were not acknowledged by the original leader in those few seconds when two leaders worked.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 29. Scenario 4. The leader on broker 1 becomes a follower after the network is restored

Scenario 5: The follower is completely separate from both other Kafka nodes and Zookeeper

The follower is completely isolated from both other Kafka nodes and Zookeeper. It is simply removed from the ISR until the network is restored, and then catches up with the rest.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 30. Scenario 5. Isolated follower is removed from the ISR

Scenario 6: Leader is completely decoupled from both other Kafka nodes and Zookeeper

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 31. Scenario 6. Leader and two followers

The leader is completely isolated from his followers, controller and Zookeeper. For a short period, it will continue to accept entries from acks=1.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 32. Scenario 6. Leader isolation from other Kafka and Zookeeper nodes

Not receiving requests after expiration replica.lag.time.max.ms, it will try to compress the ISR to itself, but will not be able to do so because there is no connection to Zookeeper, then it will stop accepting writes.

Meanwhile, Zookeeper will mark the isolated broker as dead and the controller will elect a new leader.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 33. Scenario 6. Two leaders

The original leader may receive writes for a few seconds, but then stops receiving any messages. Clients are updated every 60 seconds with the latest metadata. They will be informed about the leader change and will start sending records to the new leader.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 34. Scenario 6. Manufacturers switch to a new leader

All acknowledged records made by the original leader since the loss of connectivity will be lost. Once the network is restored, the original leader will discover through Zookeeper that it is no longer the leader. It will then truncate its log to the HW of the new leader at the time of the election and start sending requests as a follower.

RabbitMQ vs. Kafka: fault tolerance and high availability
Rice. 35. Scenario 6. The original leader becomes a follower after network connectivity is restored

In this situation, a logical separation may be observed for a short period, but only if acks=1 и min.insync.replicas also 1. The logical separation is automatically completed either after the network is restored, when the original leader realizes that he is no longer the leader, or when all clients understand that the leader has changed and start writing to the new leader - whichever happens first. In any case, some messages will be lost, but only with acks=1.

There is another variant of this scenario where, just before the network split, the followers fell behind and the leader compressed the ISR to itself. It is then isolated due to loss of connectivity. A new leader is elected, but the original leader continues to accept entries, even acks=all, because there is no one else in the ISR except him. These records will be lost after the network is restored. The only way to avoid this option is min.insync.replicas = 2.

Scenario 7: Kafka controller node cannot see another Kafka node

In general, after losing connection with a Kafka node, the controller will not be able to send it any information about the leader change. In the worst case, this will lead to a short-term logical separation, as in scenario 6. More often than not, the broker simply will not become a candidate for leadership in the event of a failure of the latter.

Scenario 8: Kafka controller does not see Zookeeper

From the fallen controller, Zookeeper will not receive a ping and will select a new Kafka node as the controller. The original controller may continue to present itself as such, but it does not receive notification from Zookeeper, so it will not have any tasks to perform. As soon as the network is restored, it will realize that it is no longer a controller, but has become a regular Kafka node.

Takeaways from the scripts

We see that the loss of follower connectivity does not result in lost messages, but simply temporarily reduces redundancy until the network recovers. This, of course, can result in data loss if one or more nodes are lost.

If the leader is separated from the Zookeeper due to loss of connectivity, this can lead to the loss of messages from acks=1. Lack of communication with the Zookeeper causes a momentary logical split with two leaders. This problem is solved by the parameter acks=all.

Parameter min.insync.replicas to two or more replicas provides additional guarantees that such short-term scenarios will not lead to message loss, as in scenario 6.

Message Loss Summary

We list all the ways how you can lose data in Kafka:

  • Any leader failure if messages were acknowledged with acks=1
  • Any unclean transition of leadership, i.e. to a follower outside of the ISR, even with acks=all
  • Isolate leader from Zookeeper if messages were acknowledged with acks=1
  • The complete isolation of the leader who has already compressed the ISR group to himself. All messages will be lost, even acks=all. This is only true if min.insync.replicas=1.
  • Simultaneous failures of all nodes in the partition. Because messages are acknowledged from memory, some may not yet be written to disk. After restarting the servers, some messages may be missing.

Impure leadership transitions can be avoided by either disallowing them or by providing at least two redundancy. The strongest configuration is the combination acks=all и min.insync.replicas more than 1.

Direct Comparison of RabbitMQ and Kafka Reliability

To ensure reliability and high availability, both platforms implement a system of primary and secondary replication. However, RabbitMQ has an Achilles heel. When reconnecting after a failure, nodes discard their data and synchronization is blocked. This double whammy calls into question the longevity of large queues in RabbitMQ. You will have to put up with either reduced redundancy or long-term locks. Reducing redundancy increases the risk of massive data loss. But if the queues are small, then for the sake of providing redundancy, short periods of unavailability (a few seconds) can be dealt with by using connection retries.

There is no such problem in Kafka. It discards data only from the point of difference between the leader and the follower. All general data is saved. In addition, replication does not block the system. The leader keeps accepting records as the new follower catches up, so it becomes a trivial task for devops to join or rejoin the cluster. Of course, there are still issues, such as network throughput when replicating. If multiple followers are added at the same time, you may encounter a bandwidth limit.

RabbitMQ outperforms Kafka in terms of reliability when multiple servers in a cluster fail at the same time. As we have already said, RabbitMQ sends a confirmation to the publisher only after the message is written to disk at the master and all mirrors. But this adds extra latency for two reasons:

  • fsync every few hundred milliseconds
  • Mirror failure may be noticed only after the lifetime of packets that check the availability of each node (net tick) has expired. If the mirror slows down or falls, it adds a delay.

Kafka is betting that if a message is stored across multiple nodes, messages can be acknowledged as soon as they are in memory. Because of this, there is a risk of losing messages of any type (even acks=all, min.insync.replicas=2) in case of simultaneous failure.

In general, Kafka exhibits superior software performance and was originally designed for clusters. The number of followers can be increased to 11 if necessary for reliability. Replication factor of 5 and minimum number of replicas in sync min.insync.replicas=3 make the loss of a message a very rare event. If your infrastructure is capable of providing this replication factor and level of redundancy, then you can choose this option.

RabbitMQ clustering is good for small queues. But even small queues can quickly grow with heavy traffic. Once the queues get large, you will have to make a hard choice between availability and reliability. RabbitMQ clustering is best suited for less common situations where the benefits of RabbitMQ's flexibility outweigh any disadvantages of its clustering.

One antidote to RabbitMQ's large queue vulnerability is to split it up into many smaller queues. If you do not require a complete ordering of the entire queue, but only the corresponding messages (for example, messages from a specific client), or nothing at all, then this option is acceptable: see my project Rebalancer to split the queue (the project is still at an early stage).

Finally, don't forget about a number of bugs in the clustering and replication mechanisms of both RabbitMQ and Kafka. Over time, systems have become more mature and stable, but no message is ever 100% secure against loss! In addition, large-scale accidents happen in data centers!

If I missed something, made a mistake, or you disagree with any of the theses, feel free to write a comment or contact me.

I am often asked: “Which should I choose, Kafka or RabbitMQ?”, “Which platform is better?”. The truth is that it really depends on your situation, current experience, etc. I hesitate to give my opinion as it would be too simplistic to recommend one platform for all use cases and possible limitations. I wrote this series of articles so that you can form your own opinion.

I want to say that both systems are leaders in this area. Perhaps I'm a little biased, because in my experience with my projects, I tend to value things like guaranteed message ordering and reliability more.

I see other technologies that lack this reliability and guaranteed ordering, then I look at RabbitMQ and Kafka - and I understand the incredible value of both of these systems.

Source: habr.com

Add a comment