Apache Cassandra is one of the most popular open-source distributed NoSQL database management systems, and “Cassandra: The Definitive Guide – Distributed Data at Web Scale” is one of the best introduction to many aspects of this powerful distributed database.
The book gives a thorough description of the fundamental concepts of Apache Cassandra, starting with its history and what differentiates a distributed, masterless NoSQL system such as Cassandra from traditional, typical RDBMS systems such as Oracle, MS SQL Server, PostgreSQL and MySQL. Authors don’t shy away from going into what the famous CAP theorem says about distributed systems, and what kind of trade-offs and decisions underlie the Cassandra architecture, leading to high availability, partition tolerance, write performance and eventual consistency.
The chapters on CQL (Cassandra Query Language) and Data Modelling are particularly important for big data architects, as well as data engineers: without fully grasping the fine points and pitfalls of data modelling in Cassandra, it is very likely that you might fall into thinking along the patterns you gained from the RDBMS world. And without a correct data model as a starting point, it is pointless to discuss other issues you might encounter later related to performance, complexity, etc. These two chapters teach the reader how to do data modelling correctly and use formal methodologies employing Chebotko diagrams (see “A Big Data Modeling Methodology for Apache Cassandra” and “World’s Best Data Modeling Tool for Apache Cassandra” for more details).
Once you are well-versed in Big Data Modelling for Cassandra, the book lays down the architecture of Cassandra, and you are introduced the main concepts, components and processes that make up Cassandra such as:
- gossip protocol,
- failure detection,
- rings, tokens, virtual nodes, partitioners,
- replication strategies,
- consistency levels,
- commit log,
- memtable & SSTables,
- hinted hand-off,
- lightweight transactions,
- Paxos for consensus,
- tombstones & compaction,
- Bloom Filters,
- repair mechanisms and Merkle trees.
After that you learn about how to configure Cassandra based on your data-center considerations and various configuration options. This chapter gives the basic options but you’ll probably need more than that in a real-life setting.
The chapter on clients, drivers and how to do basic programming by connecting to Cassandra is brief and not very detailed. Nevertheless the code examples provide a fine starting point.
The book dedicates almost 30 pages to describing the Read and Write Paths of Cassandra, and it is a delight to read for those curious readers interested in the internals of a complex data storage engine, to see the step-by-step journey of a read and write query, understanding what phases it goes through helps fill in the gaps in your understanding of Cassandra’s working. It is also complementary to your data modelling skills, answering some of the “why” questions: by knowing how read/write path works, you realize the reasoning behind data modelling recommendations.
Among the remaining chapters, “Monitoring”, “Maintenance”, “Performance Tuning”, and “Security” contain adequate information as an introduction, though you will still need to be careful for pitfalls, e.g. “hidden” tombstones caused by writing multi-value data types (sets, lists and maps), after all, the devil is in the details!
The final chapter of “Deploying and Integrating” is a little lightweight: you’ll definitely need more information than the book provides, so you should consider this chapter only a small starting point, and nothing more.
A very nice point is that authors also provide links to relevant Cassandra JIRA issue numbers when they describe the fine details of a feature or issue. This is very much aligned with the open source nature of Cassandra, being an Apache Software Foundation project. This also lets the curious reader to learn many more details first-hand. Authors also provide extra explanation about and pointers to interesting aspects of Cassandra such as the “ϕ Accrual Failure Detector“, “Paxos protocol“; why and how they are used in Cassandra. After all, we are talking about a distributed, masterless database system that’s know to scale to 75.000 nodes (e.g. in Apple’s case), and these fundamental algorithms play an important role.
In the interesting discussion about the architecture of the Cassandra’s SEDA (Staged Event-Driven Architecture), the authors note that there are some shortcomings discovered in recent years, and the curious reader can consult Cassandra JIRA issue web site, particularly the following ones: Move away from SEDA to TPC, Move away from SEDA to TPC, stage 1, and Make read and write requests paths fully non-blocking, eliminate related stages.
Being an open source project, Apache Cassandra is a moving target, accepting contributions from software engineers worldwide, but the fundamental concepts, techniques and hardware organization don’t change every, therefore as TM Data ICT Solutions we consider this book a good reference and guide for big data engineers and architects.