Reading Notes: "Large-scale Distributed Storage Systems: Principles and Architecture in Action": II

This article was last updated on: February 7, 2024 pm

🔖 Books:

“Large-scale Distributed Storage System: Principles Analysis and Architecture Practice”

By Yang Chuanhui

2 Stand-alone storage system


  1. Atomicity
  2. consistency
  3. Isolation
  4. Durability

Storage Engine:

  1. Hash storage engine
  2. B-tree storage engine
  3. LSM Tree (Log Structure Mereg Tree) storage engine

2.1 Hardware Basics

CPU: Symmetric Multiprocessing Structure (SMP). SMP has limited scalability.
In order to improve scalability, the current mainstream server architecture is generally NUMA (Non-Uniform Memory Access) architecture. It has multiple NUMA nodes, each NUMA node is an SMP structure, generally composed of multiple CPUs (such as 4), and has independent local memory, io slots, etc.

NUMA nodes can access local memory directly and quickly, or they can access the memory of other NUMA nodes through NUMA interconnection modules, which can access local memory much faster than remote access.

2.1.2 IO bus

The Interl x48 motherboard is a typical north-south bridge architecture. The Northbridge chip is connected to the CPU through the Front Side Bus (FSB), and the memory modules and PIC-E devices (such as the high-end SSD device Fusion-IO) are attached to the Northbridge. The North Bridge and the South Bridge are connected by DMI.

2.1.3 Network Topology

Three-tier topology.

Access layer (Edge), aggregation layer (Aggregation layer), core layer (Core).

Problem: There may be many switches at the access layer connected to the aggregation layer, and many switches at the aggregation layer to the core layer. If the bandwidth between servers in the same access layer is 1 Gb, the bandwidth between servers under different access layer switches is less than 1 Gb. Since servers at the same access layer are often deployed in a single rack, the design of the system needs to consider whether the server is in one rack, reducing the copying of large amounts of data across racks. For example, Hadoop HDFS defaults 3 copies, where two replicas are placed in the same rack.

In 2008, Google transformed the network into a flat topology, a three-tier CLOS network, supporting up to 20,480 servers in the same cluster, and any two of them have 1Gb bandwidth. CLOS networks require additional investment in more switches. The advantage is that the system does not need to be designed with the underlying network topology in mind, making it easy to make the entire cluster into a resource pool.

🤔 How to calculate fiber latency?

For example, Beijing and Hangzhou have a straight-line distance of 1300 kilometers, and the light walks a broken line in the light, assuming that the broken line distance is 1.5 times the straight-line distance, then the theoretical value of the optical transmission network round trip delay is 1300*1.5*2/300 000 = 13ms, the actual test value is about 40ms.

2.1.4 Performance Parameters

Category Time elapsed
Visit L1 Cache 0.5 ns
Branch prediction failed 5 ns
Visit L2 Cache 7 ns
Mutex locked, unlocked 100 ns
Memory Access 100 ns
Gigabit networks send 1 MB of data 10 ms
Read 1 MB of data sequentially from memory 0.25 ms
Network back and forth in the computer room 0.5 ms
Network back and forth between remote computer rooms 30-100 ms
SATA Disk Seek 10 ms
Sequential reading of 1 MB of data from SATA disks 20 ms
SSD SSD access latency 0.1-0.2 ms

The performance bottleneck of the storage system is mainly the diskRandom read and write

SSD features: low random read latency.

Category IOPS Price per GB (RMB) Random read Random write
Memory Tens of millions 150 Friendly Friendly
SSD disk 35000 20 Friendly Write amplification issues
SAS Disks 180 3 Disk seek Disk seek
SATA disk 90 0.5 Disk seek Disk seek

Storage system performance mainly:Throughput and access latency, with guaranteed access latency, achieve the highest possible throughput at the lowest cost. Disks are suitable for large-block sequential access, and SSDs are suitable for critical systems with high random access or sensitivity to latency. The two are often combined, with hot data stored on SSDs and cold data stored on disks.

2.2 Stand-alone storage engine

  • Hash: Sequential scanning is not supported
  • B-tree: Suitable for relational databases
  • LSM tree: Batch dump technology is used to avoid disk random write problems.

2.3 Data Model

2.3.1 File Model

POSIX (Portable Operating System Interface), which requires the atomicity of operations when reading and writing are concurrent, that is, the read operation either reads all the results or nothing can be read; In addition, read operations are required to be able to read the results of all previous write operations. Suitable for stand-alone file systems. Distributed file systems do not adhere strictly to this standard.

NFS allows clients to cache file data, and inconsistencies can occur when multiple clients modify the same file concurrently.

The object model is similar to the file model.Weakens the concept of directory treesrequestObjects are written to the system once, and only the entire object can be deleted, and no part of it is allowed to be modified.

2.3.2 Relational Model

Each relationship is a table consisting of multiple tuples (rows), each of which in turn contains multiple attributes (columns).

The relationship name, attribute name, and attribute type are called the schema of the relationship. For example, the pattern of the Movie relationship isMovie(title, year, length), title year length 为属性,类型分别为字符串 整数 整数

SELECT 	<属性表 >
FROM < 关系表 >
WHERE < 条件 >
GROUP BY < 属性表 >
HAVING < 条件 >
ORDER BY < 属性表>

SQL features:

  1. Allow in WHRER FROM HAVING A subquery is used in words, and the subquery is completeselect-from-where statement
  2. Index (to reduce the amount of data scanned during SQL execution and improve read performance)
  3. Transactions (ACID characteristics when multiple operations are guaranteed to execute concurrently)

2.3.3 Key-Value Model

Key-value model:

  • Put
  • Get
  • Delete

Tabular model:

Google Bigtable, HBase, in addition to supporting simple primary key-based operations, also supportRange scanning, in addition, also supportedColumn-based operations。 The main operations are as follows:

  • Insert
  • Delete
  • Update
  • Get
  • Scan: Scan a range of data, determine the scope of scanning according to the primary key, support scanning some columns, and support filtering, sorting, grouping, etc. by column.

Multi-table association operations are not supported, secondary indexes are generally not supported, and transaction operation support is relatively weak, and there is no unified standard. In addition, tabular models are supportedNo mode(schema-less – a feature that does not require pre-definition of which columns to include in each row and the type of each column, allowing different columns to be included between multiple rows.)

2.3.4 SQL vs. NoSQL

Non-relational database (NoSQL, Not Only SQL), good scalability, weakened consistency requirements, to a certain extent to solve the problem of massive data and high concurrency.

Relational databases have problems in scenarios with massive data:

  1. Affairs. In a distributed system, if multiple operations belong to different servers, a two-phase commit protocol is required to ensure their atomicity, which has low performance and cannot tolerate server failures, making it difficult to apply in scenarios with massive data.
  2. Conjunctions. In the scenario of massive data, in order to avoid database multi-table association operations, data redundancy and other means that violate the database paradigm are often used. But the benefits outweigh the costs.
  3. Performance.

Issues with NoSQL:

  1. Lack of uniform standards
  2. Complex use and O&M.

2.4 Transaction Concurrency Control

For performance considerations, multiple isolation levels are defined, and the concurrency control of transactions is often achieved through a lock mechanism, which can have different granularity, can lock rows, can lock data blocks or even entire tables.

Internet services, read more than write, in order to improve performance, can be adopted Copy on write(Copy-On-Write, COW) or Multi-version concurrency control(Multi-Version Concurrency Control, MVCC) to avoid write transactions blocking read transactions.

2.4.1 Transactions

  • Atomicity
  • consistency
  • Isolation
  • Persistence


The transaction modifies the data either in its entirety or at all. For example: A transfers a sum of money x to B, the result must be that A deducts x, and B adds x.


Data correctness, completeness, consistency. For example, the data type must be correct, and the data value must be within the specified range.


Guarantees that each transaction is invisible to other transactions until all its modifications have been completed. Or the above example, during the transfer process, the query operation cannot see the status that account A has been debited x, but account B has not debited x.


After the transaction is complete, the impact on the database is permanent.

For performance reasons, consumers are allowed to choose to sacrifice isolation properties in exchange for concurrency.

4 isolation levels:

  1. Read Uncommitted (RU): Read uncommitted data, minimum isolation level.
  2. Read Committed (RC): Read the submitted data
  3. Repeatable Read (RR): Repeatable read.
  4. Serializable(S): Serializable. Highest isolation level

A decrease in the isolation level can result in dirty data and transaction execution exceptions, such as:

  1. Lost Update (LU): Missing Update
  2. Dirty Reads (DR): Dirty Reads. One transaction reads the data updated by another transaction without committing it.
  3. Non-Repeatable Reads (NRR): Multiple reads of the same data give different results.
  4. Second Lost Updates problem (SLU): The second type of lost updates. If two concurrent transactions read and modify the same data at the same time, the later modification may invalidate the previous modification.
  5. Phantom Reads (PR): During transaction execution, because another transaction inserts data during the previous query and the following query, the later query results appear data that did not appear in the previous query results.

2.4.2 Concurrency Control

Database locks

Read lock and write lock. Multiple read locks are allowed on the same row element, but only one write lock is allowed. The elements here can be a row, a block of data, or even a table.

Ideas for resolving deadlocks:

  1. The transaction sets the timeout period, and the timeout is automatically rolled back.
  2. Deadlock detection.

Copy on write

Read operations are not locked.

Multi-version concurrency control

Maintain multiple versions of each row of data.

InnoDB maintains two implied columns for each row.

2.5 Failback

Operational log (commit log) technology for failure recovery.

Operation logs are divided into: rollback log (UNDO Log), redo log (REDO Log) and UNDO/REDO log.

Record the state before the transaction is modified, which is the rollback log; Record the modified state of the transaction, then redo the log.

2.5.1 Operation log

By recording each database operation sequentially and performing these operations in memory, the data in memory is periodically flushed to disk, realizing the transformation of random write requests into sequential write requests.


Transaction T performs an addition of 10 operations to X in the table, initial X = 5, updated X = 15, then

The UNDO log is recorded as <T, X, 5>, and the REDO log is recorded as <T, X, 15>

The UNDO/REDO log is recorded as <T, X, 5, 15>

2.5.2 Redo logs

2.5.3 Means of optimization

  1. Group Commit
  2. Checkpoint: Periodically dumps data in memory to disk.

2.6 Data Compression

Lossy and lossless compression.

Lossless: LZ series compression algorithm. Such as: GZip

Compression algorithm core: Find duplicate data. Columnar storage greatly improves the compression ratio of data.

2.6.1 Compression algorithms

  1. Huffman coded
  2. LZ series compression algorithm: A dictionary-based compression algorithm
  3. BMDiff vs Zippy (compared to traditional LZ, the advantage is very fast processing speed)

2.6.2 Columnar storage

OLAP: A query accesses millions or even billions of rows of data, and the query tends to only care about a few columns. Such as: Check the top 20 products with the highest sales this year. Focus only on 3 columns: Time, Merchandise and Sales.

Column group: The values of multiple data columns that are frequently accessed together are stored together.

Due to the high degree of duplication of the same column data, there are great advantages when compressing columnar databases.

Index-specific optimization can be done for columnar storage.

Reading Notes: "Large-scale Distributed Storage Systems: Principles and Architecture in Action": II
Posted on
August 13, 2021
Licensed under