This was an Okay-ish book, and it had nothing to do with designing any applications. Instead, it gives you a very detailed tour about how distributed systems work, and their fault-tolerance technologies.
In each Chapter, I’ll list down points I found important
Chapter 1: Reliable, Scalable, and Maintainable Application
Reliability: Systems work correctly even when faults occur
Scalability Performance remains even when load increases
Maintainability: Easy to Operate, Easy to Understand, Easy to Evolve.
Chapter 2: Data Models and Query Languages
Relational database vs Document database vs Graph database.
Document databases and Graph databases don’t enforce schemas, while Relational database enforces a schema.
Chapter 3: Storage and Retrieval
Storage Engines are broadly either OLTP or Data Warehouses (OLTA)
OLTP systems are typically user facing, which handles huge amounts of requests, and may change frequently. Each query is usually small
Data Warehouses are optimized for analytics, where queries are typically huge (Give me the average sales for the last 5 years).
Column Oriented storage is an optimization for OLTA type queries
Performances of memory-only databases is not due to the absence of reading from the disk, but from the absence of encoding the data to a format that can be written onto the disk
Chapter 4: Encoding and Evolution
How to encode data structures into a efficient format for storage and transmission.
The more common ones: JSON, XML, CSV.
The not so common ones: Binary Schemes (Thrift, Protocol Buffers, Avro)
Upgrading of a database needs to be done in a rolling format, where some nodes are upgraded at a time, rather than all of the nodes at once.
Chapter 5: Replication
When talking about distributed databases, replication of the data is a huge topic. It ensures:
- High Availability: Keeping operations running when one storage is down
- Disconnected Operations: Allow operations to run independently
- Latency: Lowering Latency with closer storage to the users
- Scalability: Scaling to multiple storages
Replication can be done with:
- Single-Leader Replication
- Clients sends all write to a single node
- Read can occur on any node, but they might be stale
- Bottleneck of a Single leader
- Multi-Leader Replication
- Client sends writes to several leader nodes
- More robust, as there is no bottleneck Leader.
- Conflicts may occur, and conflict resolution is needed
- Leaderless Replication
- Client sends writes to several nodes
- Read-Repair – Reads are done in parallel to detect stale data
- Anti-Entropy Process – Background process to patch missing data between nodes
Replication Lag happens when the data is not updated in other nodes yet. An application must have the following properties even under replication lag:
- Read-After-Write consistency
- A user must be able to read his own written data (If you write to one node, and read from another, you may not be able to read your own write)
- Monotonic Reads
- After a user see the data at one point in time, they must not see the data from earlier points in time
- Consistent Prefix Reads
- The data must maintain causality (Questions and Answers in the right order
Chapter 6: Partitioning
- Key Partitioning
- Hash Partitioning
- Compound Partitioning (Key + Hash)
Skewed Workloads and Hot Spots happen when there exists many values for a particular key (E.g. some Twitter users have millions of followers). These keys need to be further split.
Chapter 7: Transactions
Error handling includes abandoning the entire transaction, and retrying again.
- Dirty Reads
- One client reads another client’s writes before they have been committed
- Read-Commit isolation prevents this
- Dirty Writes
- One client overwrites data the another client has written, by not yet committed
- Transactions prevents this (Locking)
- Read Skew
- Client sees different parts of the database at different points in time.
- This can be addressed with Snapshot Isolation
- Implemented using Multi Version Concurrency Control (MVCC)
- Lost Updates
- Two clients concurrently perform a read-modify-write cycle, and one overwrites the other’s changes.
- Write Skew
- A transaction reads something and makes a decision on what it saw, and then makes a write. But before the write, the data could have been changed, making the decision invalid.
- Serializable Isolation prevents this
- Phantom Reads
- A transaction reads an object that matches a search condition, another client changes that value so it does not match anymore.
- Snapshot Isolation prevents this
Implementing Serialization Transactions (to prevent race conditions):
- Literally executing each transaction serially (Slow)
- Two-Phase Locking
- Serializable Snapshot Isolation (SSI)
Chapter 8: The Trouble with Distributed Systems
- When sending a packet over from on node to another, and there was no reply, you don’t know if it was due to a network issue, or the other node is down
- A node’s clock may be out of sync with other node’s clock, and may have issues when it comes to conflict resolution using time
- A process may be paused temporarily, and be falsely declared dead by other nodes
Most systems rely on timeouts to determine if another node is dead, but this cannot distinguish between a live node, or a choppy network.
Once a fault has been detected, fixing it can be difficult as well because there is no shared state
If possible, try to run things on a single machine, instead of a distributed one.
Chapter 9: Consistency and Consensus
Ways to achieve consistency in distributed systems:
- Linearizability (Strong Consistency)
- Causality (Weak Consistency)
- Total Order Broadcast
- Lamport Timestamps
If a leader node fails, you can either:
- Wait for the leader to recover (Too long)
- Manually pick a new leader (Human intervention)
- Use a consensus algorithm
Ways to achieve consensus in distributed systems (How distributed systems agree):
- Atomic Commit
- Two-Phase Commit
- Three-Phase Commit
Consensus made must satisfy the following conditions:
- Uniform agreement (No two nodes makes decisions differently)
- Integrity (No nodes decides more than once)
- Validity (If a node decides on value v, that value v was proposed by some other node)
- Termination (Every node that is alive must vote)
Chapter 10: Batch Processing
Batch Processing Steps (MapReduce):
- Read in a set of records
- Mapper function extracts key-value pairs from the records
- Sort the records by the keys
- Reducer function operates on the sorted key-value pairs
Two main problems distributed batch processing needs to solve:
- Partitioning (Skews and Hot Keys)
- Fault tolerance (Consistent writes to disk allows rerunning individual tasks)
MapReduce jobs are designed to tolerate failure not because of unreliable hardware, but because of the freedom to terminate long running processes that are of a lower priority.
Chapter 11: Stream Processing
There are two types of stream processors
- Message-style processing (MQTT)
- Log-based message brokers (Kafka)
Log-based message brokers do not destroy messages once deleted, and act like a database. Message-styles destroy the message once delivered.
Types of windows to partition streams:
- Tumbling Window
- Hopping Window
- Sliding Window
- Session Window
Change Data Capture (CDC) captures all changes to a database node, and streams it to other distributed nodes to achieve consistency
Fault tolerance in Stream Processing can be done by:
- Micro-batching and checkpoints of the stream
- Atomic Commits
- Maintain snapshots of states after the stream has been process (Recovery can happen by continuing the process at the latest state, instead of rerunning the entire stream)