Designing Data-Intensive Applications is almost like an encyclopedia of modern data engineering. Like a specialized encyclopedia, it covers a broad field in considerable detail. But it is not a practice or a cookbook for a particular Big Data, NoSQL or newSQL product. What the author does is to lay down the principles of current distributed big data systems, and he does a very fine job of it.
If you are after the obscure details of a particular product, or some tutorials and “how-to”s, go elsewhere. But if you want to understand the main principles, issues, as well as the challenges of data intensive and distributed system, you’ve come to the right place.
Martin Kleppmann starts out by solidly giving the reader the conceptual framework in the first chapter: what does reliability mean? How is it defined? What is the difference between “fault” and “failure”? How do you describe load on a data intensive system? How do you talk about performance and scalability in a meaningful way? What does it mean to have a “maintainable” system?
Second chapter gives a brief overview of different data models and shows the suitability of them to different use cases, using modern challenges that companies such as Twitter faced. This chapter is a solid foundation for understanding the difference between the relational data model, document data model, graph data model, as well as the languages used for processing data stored using these models.
The third chapter goes into a lot of detail regarding the building blocks of different types of database systems: the data structures and algorithms used for the different systems shown in the previous chapter are described; you get to know hash indexes, SSTables (Sorted String Tables), Log-Structured Merge trees (LSM-trees), B-trees, and other data structures. Following this chapter, you are introduced to Column Databases, and the underlying principles and structures behind them.
Following these, the book describes the methods of data encoding, starting from the venerable XML & JSON, and going into the details of formats such as Avro, Thrift and Protocol Buffers, showing the trade-offs between these choices.
Following the building blocks and foundations comes “Part II”, and this is where things start to get really interesting because now the reader starts to learn about challenging topic of distributed systems: how to use the basic building blocks in a setting where anything can go wrong in the most unexpected ways. Part II is the most complex of part the book: you learn about how to replicate your data, what happens when replication lags behind, how you provide a consistent picture to the end-user or the end-programmer, what algorithms are used for leader election in consensus systems, and how leaderless replication works.
One of the primary purpose of using a distributed system is to have an advantage over a single, central system, and that advantage is to provide better service, meaning a more resilient service with an acceptable level of responsiveness. This means you need to distribute the load and your data, and there a lot of schemes for partitioning your data. Chapter 6 of Part II provides a lot of details on partitioning, keys, indexes, secondary indexes and how to handle data queries when your data is partitioned using various methods.
No data systems book can be complete without touching the topic of transactions, and this book is not an exception to the rule. You learn about the fuzziness surrounding the definition of ACID, isolation levels, and serializability.
The remaining two chapters of Part II, Chapter 8 and 9 is probably the most interesting part of the book. You are now ready to learn the gory details of how to deal with all kinds of network and other types of faults to keep your data system in usable and consistent state, the problems with the CAP theorem, version vectors and that they are not vector clocks, Byzantine faults, how to have a sense of causality and ordering in a distributed system, why algorithms such as Paxos, Raft, and ZAB (used in ZooKeeper) exist, distributed transactions, and many more topics.
The rest of the book, that is Part III, is dedicated to batch and stream processing. The author describes the famous Map Reduce batch processing model in detail, and briefly touches upon the modern frameworks for processing distributed data processing such as Apache Spark. The final chapter discusses event streams and messaging systems and challenges that arise when trying to process this “data in motion”. You might not be in the business of building the next generation streaming system, but you’ll definitely need to have a handle on these topics because you’ll encounter the described issues in the practical stream processing systems that you deal with daily as a data engineer.
As I said in the opening of this review, consider this a mini-encyclopedia for the modern data engineer, and also don’t be surprised if you see more than 100 references at the end of some chapters; if the author tried to include most of them in the text itself, the book would well go beyond 2000 pages!
We recommend this book to senior data engineer, especially the ones working on distributed big data systems, dealing with NoSQL and newSQL databases, document stores, column oriented data stores, streaming and messaging systems.