Mini-interview with Oleg Anastasiev: fault tolerance in Apache Cassandra





Classmates are the largest user of Apache Cassandra in RuNet and one of the largest in the world. We started using Cassandra in 2010 to store photo estimates, and now Cassandra manages petabytes of data on thousands of nodes, moreover, we even developed our own NewSQL transactional database .

On September 12, we will hold the second meeting dedicated to Apache Cassandra in our St. Petersburg office. The main speaker of the event will be the chief engineer Odnoklassnikov Oleg Anastasiev. Oleg is an expert in the field of distributed and fault-tolerant systems, he has been working with Cassandra for over 10 years and has repeatedly talked about the features of this product at conferences .



On the eve of the meeting, we talked with Oleg about the fault tolerance of distributed systems with Cassandra, asked what he would talk about at the meeting and why it was worth attending this event.



Oleg began his career as a programmer back in 1995. Developed software in the banking sector, telecom, transport. He has been working as a leading developer at Odnoklassniki since 2007 as a member of the platform team. His responsibilities include the development of architectures and solutions for high-load systems, large data warehouses, solving the problems of productivity and reliability of the portal. He is also engaged in training developers within the company.



- Oleg, hello! In May, the first meeting dedicated to Apache Cassandra took place, the participants say that the discussions went until late at night, please tell me, what are your impressions of the first meeting?



Developers with different backgrounds from various companies came with their pain, unexpected solutions to problems and amazing stories. We were able to conduct most of the meeting in the format of the discussion, but there were so many discussions that we were able to touch on only a third of the topics that were outlined. We paid a lot of attention to how and what we monitor using our real production services as an example.



I was interested and really enjoyed it.



- Judging by the announcement, the second mitap will be entirely devoted to fault tolerance, why did you choose this topic?



Cassandra is a typical loaded distributed system with a huge amount of functionality in addition to directly serving user requests: gossip, failure detection, distribution of schema changes, expanding / reducing the cluster, anti-entropy, backups and recovery, etc. As in any distributed system, with an increase in the amount of iron, the probability of failures increases, so the operation of the production of Cassandra clusters requires a deep understanding of its device to predict the behavior in case of failures and operator actions. In the process of using Cassandra for many years, we have accumulated significant expertise , which we are ready to share, and we also want to discuss how our colleagues solve typical problems.



- When it comes to Cassandra, what do you mean by fault tolerance?



First of all, of course, the ability of the system to survive typical hardware failures: loss of machines, disks, or network connectivity with nodes / data centers. But the topic itself is much broader and in particular includes recovery from failures, including failures for which people are rarely prepared, for example, operator errors.



- Can you give an example of the most loaded and largest data cluster?



One of our largest clusters is the gift cluster: over 200 nodes and hundreds of TB of data. But it is not the most loaded, because it is covered by a distributed cache. Our busiest clusters hold tens of thousands of RPS for writing and thousands of RPS for reading.



- Wow! How often does something break?



Yes, constantly ! In total, we have more than 6 thousand servers, and every week a couple of servers and several dozen disks are replaced (without taking into account parallel upgrade processes and expanding the fleet of vehicles). For each type of failure, a clear instruction is written on what and in what order to do, everything is automated as far as possible, therefore failures are a routine and in 99% of cases they occur unnoticed by users.



- What are you struggling with such failures?



From the very beginning of the operation of Cassandra and the first incidents, we worked out backup and recovery mechanisms from them, built deploy procedures that take into account the state of Cassandra clusters and, for example, prevent nodes from restarting if data loss is possible. We plan to talk about all this at the meeting.



- As you said, absolutely reliable systems do not exist. What types of failures are you preparing for and able to survive?



If we talk about our installations of the Cassandra clusters, users will not notice anything if we lose several machines in one DC or a whole DC (this happened). With the increase in the number of DCs, we are thinking about starting to ensure operability in the event of a failure of two DCs.



- What do you think Cassandra lacks in terms of fault tolerance?



Cassandra, like many other early NoSQL repositories, requires a deep understanding of its internal structure and the ongoing dynamic processes. I would say that she lacks simplicity, predictability and observability. But it will be interesting to hear the opinion of other participants of the meeting!



Oleg, thank you very much for taking the time to answer the questions!



We are waiting for everyone who wants to talk with experts in the field of operating Apache Cassandra at the September 12 meeting in their St. Petersburg office.



Come, it will be interesting!



Register for the event.



All Articles