Reading Notes: "Large-scale Distributed Storage Systems: Principles and Architecture in Action": III

This article was last updated on: February 7, 2024 pm

🔖 Books:

“Large-scale Distributed Storage System: Principles Analysis and Architecture Practice”

By Yang Chuanhui

3 Distributed systems

3.1 Basic Concepts

3.1.1 Exceptions

1. Exception type

  1. Server Downtime: Memory data loss, need to consider how to recover memory data by reading persistent data.
  2. Network anomalies: Principle: The web is always unreliable. Any message can only confirm that the message is sent successfully if it receives a reply from the other party. The system is always designed with the assumption that the network will be abnormal and take appropriate measures.
  3. Disk failure: Multiple servers store data.

2. Timeout

RPC has 3 states:

  • succeed
  • fail
  • Timeout (unknown state): Verify success by continuously reading the status of previous operations, or design operations to be “idempotent”.

3.1.2 Conformance

Replicas are the only means of fault-tolerant technology for distributed storage systems. How to ensure the consistency between replicas is the theoretical core of the entire distributed system.

  1. Client: Read and write operations conform to a certain characteristic.
  2. Storage system: Whether multiple replicas are consistent.

From the client’s perspective, there are three types of consistency:

  1. Strong consistency
  2. Weak consistency
  3. Eventual consistency: A special case of weak consistency. There is a concept of “inconsistent windows”.

Other common variants:

  1. Read and write consistency
  2. Session consistency
  3. Monotonic Read consistency
  4. Monotonic Write consistency

From the perspective of the storage system, consistency includes:

  • Replica consistency
  • Update order consistency

3.1.3 Metrics

  1. Performance: Throughput and latency are generally contradictory.
    1. Read operations per second QPS (Query Per Second)
    2. Write operands (TPS, Transaction Per Second)
    3. Average latency or 99.9% latency
  2. Availability: several 9s
  3. Consistency: Strong consistency is recommended in the same data center without much impact on performance and availability.
  4. Scalability: Ideally scaled linearly.

3.2 Analysis

3.3 Data Distribution

A basic requirement of distributed storage systems is transparency, including data distribution transparency, data migration transparency, data replication transparency, and fault handling transparency.

3.3.1 Hash distribution

Hash modulo.

Possible problems:

  1. Uneven distribution. Large customer data volume.
  2. The number of servers changes, and almost all data has to be redistributed, resulting in a lot of data migration.
    1. The correspondence between the hash value and the server is used as metadata and handed over to a specialized metadata server to manage. When you scale out a cluster, you can assign part of the hash value to the newly added machine and migrate the corresponding data.
    2. Distributed Hash Table (DHT) algorithm. Each node in each system is assigned a random token, which forms a hash ring. When performing a data storage operation, the hash value of the key is calculated first, and then stored in the clockwise direction of the node where the first token greater than or equal to the hash value is located. When nodes are added or deleted, only the neighboring nodes in the hash ring are affected, and other nodes are not affected.

3.3.2 Sequential distribution

It is more common in distributed tabular systems to sequentially divide large tables into contiguous ranges, each of which is a child table.

3.3.3 Load balancing

3.4 Replication

3.4.1 Overview of replication

The primary replica replicates write requests to the other standby replicas, and the common practice is to synchronize the commit log. The primary replica first synchronizes the operational logs to the standby replica, which plays back the operational logs and notifies the primary replica when it is complete. Next, the primary replica modifies the local machine and waits until all operations are complete before notifying the client that the write is successful.

NWR Replication Protocol:

  • N: Number of replicas
  • W: Number of write copies
  • R: Number of read replicas

In the NWR protocol, multiple replicas no longer distinguish between primary and secondary, and the client writes data to W replicas and reads R replicas according to a certain policy. As long as W + R > N, it is guaranteed that at least one of the read copies contains the latest updates. The problem is: the order of operations of different replicas may be inconsistent, and conflicts may occur when reading from multiple replicas.

3.4.2 Consistency and Availability

CAP theory: Consistency, Availability and Tolerance of network Partition cannot be met at the same time.

  • Partition tolerance: Consistency and availability can still be met under abnormal conditions such as machine failures, network failures, and power outages in the computer room

Distributed systems require automatic fault tolerance, that is:Partition tolerance is always to be satisfiedTherefore, consistency and write availability cannot be satisfied at the same time.

  • Strong consistency is used to ensure consistency, however, when a failure occurs between the primary and secondary replicas, writes are blocked and system availability cannot be satisfied.
  • If you use asynchronous replication to guarantee availability, you can’t achieve consistency.

There is a trade-off between consistency and availability, some scenarios do not allow data loss, others where data loss is allowed, and availability is more important.

For example, Oracle’s DataGuard replication component contains 3 modes:

  • Maximum protection mode: That is, strong synchronous replication mode
  • Maximum performance mode: Asynchronous replication mode.
  • Maximum availability mode: A compromise between the above two modes. Under normal circumstances, it is equivalent to maximum protection mode, and if the active and standby networks fail, switch to maximum performance mode. (Strong synchronization (degradable)?) )

3.5 Fault Tolerance

Often passeslease(Lease) Protocol implementation.

3.5.1 Common failures

A Google data center fails in its first year

Frequency of occurrence Fault type Sphere of influence
0.5 Data center overheating Most machines are powered off within 5 minutes and restored in one to 2 days
1 Power Distribution Unit (PDU) Failure 500-1000 machines instantly off the line, 6H recovery
1 Rack Adjustments A large number of alarms, 500-1000 machines are powered off, 6H recovery
1 Network Recabling About 5% of machines are off the line for more than 2 days
20 Rack failure 40-80 machines instantly off the line, 1-6H recovery
5 Rack instability 50% packet loss on 40-80 machines
12 Router reboot DNS and external virtual IP services fail for a few minutes
3 Router failure Need to switch traffic immediately, lasting about 1H
Dozens DNS failure Lasts about 30s
1000 Stand-alone failure Machine unable to provide service
A few thousand Hard disk failure Hard disk data loss

System Design:

  1. Fault tolerance for single server failures
  2. Racks: Avoid distributing all copies of your data within the same rack.

3.5.2 Fault Detection

Heartbeat. A->B heartbeat and do not receive a heartbeat:

  1. B failure
  2. B Non-faulty:
    1. There is a problem with networks A and B
    2. B is too busy to respond.

As above, it is necessary to agree on whether machine B should be considered to have failed and stopped service. Pass lease(Lease) mechanism detects failures, lease is a kind of authorization with a timeout period. A -> B issues a lease, and B is allowed to provide services only during the validity period of the lease, otherwise voluntarily stop the service. When B’s lease is about to expire, apply for a new lease with A.

3.5.3 Failback

There are single-layer and double-layer structures.

Single layer: Assume 3 replicas, A failure:

  1. Temporary failure: The master controller waits for a period of time, and if A is back online, it is temporary.
  2. Permanent failure: such as a damaged hard disk. You need to perform the add replica operation, that is, select a node to copy A’s data and become A’s backup copy.

The master control may also fail, so it is also necessary to HA, synchronize the state to the standby in real time, and implement distributed lock services by maintaining a set of protocols such as Paxos.

3.6 Extensibility

Means of implementation:

  1. Increase the number of replicas
  2. Increase cache to improve read power
  3. Sharding data
  4. Replicate data to multiple data centers

3.6.1 Master Control Node

It is used to maintain data distribution information, perform global scheduling work such as work machine management, data location, fault detection and recovery, and load balancing.

  1. In addition to global scheduling, the master controller also needs to maintain the file system directory tree, and memory capacity may be the bottleneck
  2. If the master controller only needs to maintain the location information of data shards, it generally does not become a bottleneck

For example, GFS abandons the support for small files, and delegates the read and write control permissions to the worker ChunkServer, so as to reduce access to the master controller through client-side cache metadata.

If the master control becomes the bottleneck, then a two-level structure can be adopted. Add a layer between the master control and the working machineMetadata node, each metadata node maintains only a portion of the metadata for the distributed file system, not the entire distributed file system. In this way, the master controller only needs to maintain the metadata of the metadata node, and it is impossible to become a bottleneck.

3.6.2 Database expansion

Extensibility means:

  1. Master-slave replication
  2. Split vertically
  3. Split horizontally

Traditional DB capacity expansion faces the following issues:

  • Scaling is not flexible enough
  • Scaling is not automated enough
  • Increase replica time for a long time

3.6.3 Heterogeneous Systems

Homogeneous system: The nodes in each group serve exactly the same data, and the amount of data that needs to be migrated to increase the replica is too large

Heterogeneous systems: Divide data into many shards of similar size, and multiple replicas of each shard can be distributed to any storage node in the cluster. In the event of a failure, legacy services are restored by the entire cluster instead of a few fixed storage nodes. Since the entire cluster participates in the recovery process of the failed node, the failure recovery time is short, and the larger the cluster, the more obvious the advantage.

3.7 Distributed Protocols

The most representative:

  • Two-phase commit (2PC): Guarantees atomicity of operations across multiple nodes
  • Paxos: Ensure that multiple nodes agree on a vote, such as which node is the primary node

3.7.1 Two-Phase Submission Agreement

Implement distributed transactions.

Two types of nodes:

  1. Coordinator (CN)
  2. Transaction participants (participants, cohorts, or workers or DN)

The protocol assumes that each node will log and persist operations, and will not be lost even if the node fails.


  1. Prepare Phase. The coordinator notifies the issue participants that they are ready to commit or cancel the issue, and then enters the voting process. During the voting process, the participant informs the coordinator of his or her decision: agree (transaction participant local execution succeeded) or cancel (transaction participant local execution failed)
  2. Commit Phase. The coordinator will make a decision based on the results of the first stage of voting: submit or cancel. The moderator notifies all participants to commit the transaction if and only if all participants agree to commit the transaction, otherwise the coordinator notifies all participants to cancel the transaction. The participant takes the appropriate action after receiving a message from the facilitator.

2 types of failures:

  • A transaction participant fails. Set a timeout for each transaction, and the timeout fails.
  • The coordinator fails. Get a backup coordinator. If both permanent failures occur, transaction participants will not be able to complete the task and will wait.

In summary, 2PC is a blocking protocol, locking up other updates during execution, and cannot tolerate fault. Most distributed stores have dropped support for distributed transactions.

3.7.2 Paxos Agreement

3.7.3 Paxos and 2PC

It doesn’t work differently.

Paxos usage:

  1. Implement global lock services or name and configure services such as Zookeeper
  2. Replicate user data to multiple data centers.

For the problem of 2PC, the common practice is to combine 2PC and Paxos to ensure the atomicity of operations on multiple data shards through 2PC, and to achieve consistency between multiple copies of a data shard through the Paxos protocol. In addition, the problem of coordinator downtime in the 2PC protocol is solved through the Paxos protocol.

3.8 Cross-data center deployment

The delay is large and unstable.

Difficulty: Data synchronization and service switchover.

There are three scenarios for cross-data center deployment:

  • Switch the cluster as a whole
  • A single cluster spans data centers
  • Paxos chooses the primary copy

3.8.1 Cluster switchover

The most common scenario.

Data synchronization between data centers may be strongly synchronous or asynchronous.

If it is asynchronous, the standby room data always lags behind the host room. When the host room fails, there are 2 options:

  1. Switch the service to the standby room and endure the risk of data loss
  2. Stop service until the host room is restored.

Therefore, if the data synchronization is asynchronous, the primary and standby data centers switch tohandiwork, allowing users to choose “lose data” or “stop service” according to the characteristics of the business.

If there is strong synchronization, the data in the active and standby data centers is consistent. Automatic switching is possible.

3.8.2 A single cluster spans an equipment center

The entire cluster spans the computer room, and there is only one master control, which needs to maintain communication with the worker nodes of all the computer center. When the master control node fails, the distributed lock service detects and switches the backup node of data center 2 to the master control node.

When the master controller distributes data, you need to consider the information of the data center and try to distribute multiple copies of the same data shard to multiple data rooms.

3.8.3 Paxos Elector

Multiple replicas of each data shard form a Paxos replication group.

For example: B1-B4 is the replication group, B1 is the primary replica, when B1 fails, the other replicas will try to switch over to the primary replica, and the Paxos protocol guarantees that only one replica will succeed.

  • Advantages: Reduced dependence on total control

  • Disadvantages: Too high complexity

Reading Notes: "Large-scale Distributed Storage Systems: Principles and Architecture in Action": III
Posted on
August 13, 2021
Licensed under