Review of Database Internals: A Deep Dive Into How Distributed Data Systems Work

The most recent addition to our company library, “Database Internals: A Deep Dive Into How Distributed Data Systems Work” by Alex Petrov, belongs to a very special category of O’Reilly books such as “Designing Data-Intensive Applications” and “Cassandra: The Definitive Guide“, in the sense that it is a serious deep dive into the most fundamental and challenging aspects of big and distributed data systems that we rely on daily basis.

Today there’s an unprecedented proliferation of distributed database technologies, combined with an evergrowing multitude of cloud computing services offering them, as well as rapid advances in physical storage systems such as NVMe SSDs that force considering different trade-offs and algorithms. A working software developer, system engineer, solution architect, or a CTO can easily be overwhelmed with so many distributed NoSQL, newSQL, time-series, graph, document, key-value, embedded databases in addition to typical, traditional, enterprise RDBMS variants. Luckily, all of these fancy distributed database technologies are built on a limited number of concepts, techniques, and algorithms which are concisely introduced and surveyed in “Database Internals” book.

Database Internals
Database Internals

Thanks to herculean task undertaken by the author, even if you’re not a PhD-level researcher specialized in distributed big data systems, or a direct contributor to one of the popular projects in this domain, you will have a good overview of all the concepts, algorithms, techniques, as well as challenges related to distributed database technologies. This, in turn, will be useful when you need to make decisions which data technologies to choose among myriad alternatives, or identify potential issues when operating and debugging such systems.

Some aspects of the book might be a little surprising, because unlike other O’Reilly books that are more practitioner oriented for a single technology or a single programming language, this book is for experienced engineers and senior technical managers, as well as architects: You won’t find step by step instructions, source code, pseudo-code or very detailed algorithm expositions, rather you will see that algorithms are detailed in a very high level overview. That is of course understandable, because otherwise such a book could easily become a few 1000 pages! But as the author starts with the most fundamental aspects of database organization, file formats, physical storage, indexing, querying, and transaction processing you will always find references to only the most important and directly relevant academic papers that defined the fundamental terms and algorithms. Therefore, in this regard “Database Internals” is similar to the famous Red Book (Readings in Database Systems, 5th Edition.)

On the other hand, it would be a big mistake to think that “Database Internals” is only a survey of academic literature; on the contrary you will find references to very practical, and concrete implementations. For example on page 63 there’s a link to a part of the famous SQLite database source code (balance_deeper function). Thanks to that, you can look under the hood of a database that you’re already using (you’re using a web browser to read this page, most probably Chrome or Firefox, both of which use SQLite as a database). Another example, among many, is the link to the breadcrumb stack implementation in PostgreSQL (BTStack). Of course, adding links to discussions around some of these implementations (e.g. on Jira or mail list) would even make them better by providing more context, similar to what way it’s been done in Cassandra book that had links to Jira that helped discover discussion, context, proposal, etc. Another problem: the links aren’t direct references, but redirections from the book’s website, but what if the book site at databass.dev goes down?

Having said that, it is also important to note that the flow of the the text, especially in the first half of the book makes it a bit difficult to follow, in the sense that it’s not easy at all to see how one section title connects to the other: is it in logical order, based on the steps of an algorithm etc.? This exposition is not clear enough. For example, in Chapter 4, “Implementing B-Trees”, should the subsections be read as “this is how you implement a B-Tree in a step-by-step manner, overall description” or as “these are the important bits and pieces, concepts and things to be careful in general when implementing B-Trees”? For example, subsection “Binary Search” follows “Overflow Pages“; what does that mean in terms of the logical flow of the text and algorithm implementation? This, unfortunately, not clear at all, hinting at the lack of enough editing for an otherwise excellent book.

There’s an interesting discussion on Right-Only Appends and how auto-incremented monotonically increasing values as primary index keys is used as an optimization opportunity on page 71. Unfortunately, there are minor problems, distracting the reader of the book, too: for example the in the Index “vacuum” points at page 76, but Vacuum concept starts only in the subsection title at page 74. The dedicated reader should always keep an eye on the Errata for Database Internals. Moreover, some paragraphs don’t flow very naturally, maybe judicious use of bullet point style lists would make it more clear. Also, some paragraphs are artificially broken, jumping from one concept to another without enough level detail.

Similar to previous examples, it’s good to see that the author doesn’t refrain from elaborating on outstanding and relatively new database systems such as LMDB, e.g. when he discusses B-tree Variants in Chapter 6. In the same chapter, WireTiger (MongoDB) is also an interesting example, under the category of Lazy B-trees. Another striking example is FD-Tree (Flash Disk Tree), exemplifying how you have to make change to your algorithm because the underlying storage hardware is completely different: SSDs and Flash Disks are very different from rotating hard disks, forcing new trade-offs and algorithms designs that take into account those differences.

When discussing Bw-Tree, another new variant, it is surprising because there’s no mention of Microsoft CosmosDB, one of the most famous examples that use Bw-Trees (announced in 2018: https://dbdb.io/db/cosmos-dbhttps://www.microsoft.com/en-us/research/publication/the-bw-tree-a-b-tree-for-new-hardware/). Following these, the exposition of LSM (log-structured merge) trees is very good as expected, the author being an experienced Apache Cassandra contributor, and Apache Cassandra being a heavy user of this data structure (for even more detailed discussions see Cassandra Book). There’s also a very important discussion regarding the challenges of Read & Write Amplification, as well as the dispute on whether B-Trees or LSM Trees have lower write amplification, which, in turn, ties very nicely to the famous RUM Conjecture.

Similar to previous discussions about how advances in physical storage systems drive innovations in algorithm design, the author introduces on page 161 the recent innovations in Open-Channel SSDs (Chapter 7: Log-Structured Storage, in LLAMA and Mindful Stacking Section). The curious reader will also enjoy: “Open-Channel SSD (What is it Good For)“. Once again, it is very easy to agree with the author that it’s good to be aware of the theory and history because there’s nothing entirely new, and progress is almost always incremental.

After having laid the groundwork for the fundamental aspects of modern data storage and retrieval, the second half of the book, dedicated to the primary aspects of modern distributed systems, starts with Chapter 8. It’s a good and concise refresher, particularly with respect to distributed database challenges, and striking concrete examples such as bugs from Apache Cassandra; how making some assumptions that turn out to be “not always true” can hurt (page 178):

“It is better to think about the possible problems in advance, even if a complete solution is costly to implement. By understanding and handling these cases, you can embed safeguards or change the design in a way that makes the solution more natural.”

Following these, there’s a good summary of important techniques and concepts such as Circuit Breaker, Backoff (but exponential version isn’t mentioned unfortunately), Jitter, etc .As the author states, being aware of various failure modes of distributed systems can indeed help engineers to identify and find root causes of malfunctioning systems. In other words, being able to ask the right question is a huge time saver! There’s also a brief but valuable rehashing of the “infamousCAP theorem which is followed by PACELC conjecture, and how the terms such as Consistency and Availability in CAP differ from the ones in ACID (pages 217 and 218).

The introduction of the Witness Replicas (page 237), a technique used in Cassandra and Google Spanner, is enriched by the link to relevant Cassandra Jira ticket, which, in turn, links to Voting with Witnesses: A Consistency Scheme for Replicated Files an paper that hails from 1986! Having said that, it would be even better to have more material on Strong Eventual Consistency, and how CRDTs (Conflict-free Replicated Data Types) actually help systems with strong eventual consistency, and of course see a discussion regarding trade-offs, e.g. when CRDTs wouldn’t be suitable, for which use cases, etc.

The definition of entropy in “Chapter 12. Anti-Entropy and Dissemination” is a little fuzzy. In a sense, it isn’t difficult to see why the author preferred to keep the explanation/definition at that level, but still… In relation to that, definition of anti-entropy is also a little weird: is it a technique, or simply a concept, a component, an algorithm, or family/class of algorithms? Also, on page 244, in Figure 12.1, we see “gossip” which is only mentioned and explained many pages later, on page 250. This is where a good editor should have stepped in! The sub-sections about Gossip protocols (pages 250-255) provide probably the most interesting aspects in terms of algorithmic thinking and probabilistic computation in the context of distributed systems.

The last pages of the book, “Chapter 14: Consensus” is a tour de force of one of the the most challenging, complex, and confusing family of algorithms and techniques in the realm of distributed databases, dubbed as quantum theory by one of the famous programmers:

More than 30 pages are dedicated to various variants of the famous Paxos family of consensus algorithms, in addition to relatively newer and simpler ones such as Raft and the failure scenarios they handle, including but not limited to Byzantine failures.

As closing remarks, it’s pretty safe to say that if you’re a technology practitioner working on or with distributed big data systems, or if you are a solution architect designing systems that rely on distributed databases, on-premise or offered by various cloud computing providers such as AWS, Azure or GCP, “Database Internals” book is a very good resource to guide your architectural and system design decisions.

search previous next tag category expand menu location phone mail time cart zoom edit close