Mini-interview with Oleg Anastasyev: fault tolerance in Apache Cassandra

Mini-interview with Oleg Anastasyev: fault tolerance in Apache Cassandra

Odnoklassniki is the largest user of Apache Cassandra in Runet and one of the largest in the world. We started using Cassandra in 2010 to store photo ratings, and now Cassandra manages petabytes of data on thousands of nodes, in fact, we even developed our own NewSQL transactional database.
On September 12, in our St. Petersburg office, we will hold second meetup dedicated to Apache Cassandra. The main speaker of the event will be the chief engineer of Odnoklassnikov Oleg Anastasiev. Oleg is an expert in the field of distributed and fault-tolerant systems, he has been working with Cassandra for over 10 years and has repeatedly spoke about the features of the operation of this product at conferences.

On the eve of the mitap, we talked with Oleg about the fault tolerance of distributed systems with Cassandra, asked what he would talk about at the mitap and why it is worth attending this event.

Oleg started his career as a programmer back in 1995. Developed software in banking, telecom, transport. He has been working as a lead developer at Odnoklassniki since 2007 in the platform team. His responsibilities include the development of architectures and solutions for high-load systems, large data warehouses, solving problems of performance and reliability of the portal. He also trains developers within the company.

β€” Oleg, hello! In May took place first meetup, dedicated to Apache Cassandra, the participants say that the discussions went on until late at night, please tell me, what are your impressions of the first meetup?

Developers with different backgrounds came from different companies with their pain, unexpected solutions to problems and amazing stories. We managed to hold most of the meetup in the format of a discussion, but there were so many discussions that we could only touch on a third of the topics outlined. We paid a lot of attention to how and what we monitor using the example of our real production services.

I was interested and really enjoyed it.

- According to the announcement, second meetup will be completely devoted to fault tolerance, why did you choose this particular topic?

Cassandra is a typical heavy-duty distributed system with a huge amount of functionality besides serving user requests directly: gossip, failure detection, schema change propagation, cluster expansion/shrinkage, anti-entropy, backups and restores, etc. As in any distributed system, with an increase in the amount of hardware, the probability of failures increases, so the operation of Cassandra production clusters requires a deep understanding of its structure in order to predict behavior in case of failures and operator actions. In the process of using Cassandra for many years, we accumulated significant expertise, which we are ready to share, and also want to discuss how colleagues in the shop solve typical problems.

- When it comes to Cassandra, what do you understand by fault tolerance?

First of all, of course, the system's ability to survive typical hardware failures: the loss of machines, disks, or network connectivity with nodes / data centers. But the topic itself is much broader and specifically includes failure recovery, including failures that people are rarely prepared for, such as operator errors.

- Can you give an example of the most loaded and largest data cluster?

One of our largest clusters is the Gift Cluster with over 200 nodes and hundreds of TB of data. But it is not the most loaded, because it is covered by a distributed cache. Our busiest clusters hold tens of thousands of write RPS and thousands of read RPS.

- Wow! How often does something break?

Yes constantly! In total, we have more than 6 thousand servers, and every week a couple of servers and several dozen disks are replaced (excluding parallel upgrade processes and expansion of the fleet of machines). For each type of failure, there is a clear instruction on what to do and in what order, everything is automated as much as possible, so failures are a routine and in 99% of cases occur unnoticed by users.

- How do you deal with such rejections?

From the very beginning of the operation of Cassandra and the first incidents, we worked out backup and recovery mechanisms from them, built deploy procedures that take into account the state of Cassandra clusters and, for example, do not allow nodes to be restarted if data loss is possible. We plan to talk about all this at the meetup.

β€” As you said, absolutely reliable systems do not exist. What types of failures are you prepared for and able to handle?

If we talk about our installations of Cassandra clusters, users will not notice anything if we lose several machines in one DC or an entire DC (this happened). With the growth in the number of DCs, we are thinking about how to start ensuring operability in the event of a failure of two DCs.

- What do you think Cassandra lacks in terms of fault tolerance?

Cassandra, like many other early NoSQL storages, requires a deep understanding of its internal structure and ongoing dynamic processes. I would say that it lacks simplicity, predictability and observability. But it will be interesting to hear the opinion of other meeting participants!

Oleg, thank you very much for taking the time to answer the questions!

We are waiting for everyone who wants to talk with experts in the field of Apache Cassandra operation at a meetup on September 12 in our St. Petersburg office.

Come, it will be interesting!

Register for the event.

Source: habr.com

Add a comment