Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Aleksandr Shitik

I write my own posts and books, and review movies and books. Expert in cosmology and astronomy, IT, productivity, and planning.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Martin Kleppmann

Genres: Programming

Year of publication: 2018

Year of reading: 2020

My rating: Highest

Number of reads: 1

Total pages: 640

Summary (pages): 28

Original language of publication: English

Translations to other languages: Russian, Spanish, Chinese

General Description

The book consists of 12 chapters divided into 3 sections. There are almost no images, and it's quite challenging and time-consuming to read because everything is solid text, plus the font isn't the largest. After each chapter, there's a brief summary spanning a few pages, as well as an extensive bibliography.

Summary

The first chapter explains terms like maintainability, scalability, and reliability. The author discusses what they are and why they matter. By the way, I remember from college and university that these are far from the only characteristics software should have.

The second chapter describes data models, specifically classic relational databases, document-oriented, and graph databases. Nothing too concrete yet—just features, advantages, and drawbacks. Incidentally, query languages are also covered here (besides SQL, the author even gives CSS as an example of declarative languages).

Chapter three. Here’s where things get a bit more complex and less familiar for the day-to-day work of the average developer: B-trees, SSTables, LSM-trees, hash indexes. It's not overly complicated, but you don’t work with these directly every day, so the information quickly slips your mind. At best, out of 50 pages, only a couple of thoughts stick. What is mentioned here that I encounter almost daily, though, is working with indexes and database storage subsystems (using InnoDB and MyISAM as examples).

The fourth chapter focuses on data formats used in transmission. Credit where it's due—the author doesn’t stop at JSON and XML but covers Thrift, Protocol Buffers, and Avro. They also describe data transmission approaches like REST and RPC. This, by the way, concludes the first part—about the fundamentals of information systems—and begins another part—about distributed systems.

One of the next chapters, which starts a new section, is about replication. In my opinion, the information here is quite good. It covers the main topologies for building replication systems, describes leader and follower nodes (even leaderless replication), consistency, latency issues, quorums—all of this is in the chapter.

The next chapter is about partitioning. Also well-covered. The difference between partitioning and replication. Range-based partitioning, hash partitioning, query routing, and load balancing—these are the main topics here. By the way, partitioning is described in a general way, not tied to any specific technology or database, so everything applies to other systems too: in MongoDB, Elasticsearch, and SolrCloud, it's called a "shard"; in HBase—a "region"; in Bigtable—a "tablet"; in Cassandra and Riak—a "vnode" (virtual node).

Next is a chapter on transactions. It’s also well-explained. Most of it revolves around ACID, and there’s also information about error handling and transaction aborts.

The final two chapters of the section focus on failures, issues, and problems in distributed systems (e.g., unreliable networks or clock desynchronization across devices—"unreliable clocks"), as well as consistency and consensus (linearizability) and working with distributed transactions.

Of the next three chapters (if we exclude the overly theoretical last one about the future of information systems), the ones worth highlighting are about batch and stream processing (that’s literally what they’re called). Here, the discussion covers Hadoop, MapReduce, and message brokers. However, all this information is presented somewhat superficially.

My Opinion

A theoretical programming book that delves deeply into terms like database types, transactions, replication, partitioning, high-load and availability challenges in distributed systems, stream and batch data processing—and most importantly—what tools and technologies exist to solve problems related to these approaches. Due to its length, reading it feels tedious and slow. But overall, if learning programming is like assembling a puzzle in your head, where each piece is a technology, approach, or just advice, then this book undoubtedly reveals many such pieces. So, even though the book was long and dry, and the summary—extremely lengthy, I still gave it a fairly high rating.