Reading Notes - Large-scale Distributed Storage Systems - Principle Analysis and Architecture Practice - I
This article was last updated on: July 24, 2024 am
🔖 Books:
“Large-scale Distributed Storage System: Principles Analysis and Architecture Practice”
By Yang Chuanhui
1.1 Distributed storage concept
Distributed storage system features:
- Extensible
- Low cost
- Performance
- Use
The main challenge: The persistence of data and state information requires data consistency in the process of automatic migration, automatic fault tolerance, and concurrent reading and writing.
Technical points:
- Data distribution: How to distribute data to multiple servers to ensure uniform data distribution? How do I implement cross-server read and write operations after data is distributed across multiple servers?
- consistency: How can multiple copies of data be replicated to multiple servers, even under abnormal circumstances, to ensure data consistency between different replicas?
- fault tolerance: How is a server failure detected? How do I automatically migrate data and services from a failed server to other servers in the cluster?
- Load balancing: How do I implement automatic load balancing during the normal operation of new servers and clusters? How to ensure that existing services are not affected during data migration?
- Transactions and concurrency control: How do I implement distributed transactions? How do I implement multi-version concurrency control?
- Ease of use: How to design the external interface to make the system easy to use? How to design a monitoring system and expose the internal state of the system to O&M personnel in a convenient form?
- Compress/decompress: How to design reasonable compression and decompression algorithms according to the characteristics of data? How to balance the storage space saved by the compression algorithm with the CPU computing resources consumed?
1.2 Distributed Storage Classification
Distributed storage data structure:
- Unstructured data: Office documents, text, pictures, images, audio, video
- Structured data: In a relational database, it can be represented by a two-dimensional relational table structure. The schema (including attributes, data types, and connections between the data) and the content of structured data are separated, and the schema of the data needs to be predefined.
- Semi-structured data: Such as HTML documents. Self-describing, structure and content are mixed together, with no obvious distinction and no need to predefine the schema structure of the data.
1.2.1 Distributed File System
Organized as objects, there is no association between objects, generally referred to as blob (Binary Large Object) data.
Distributed file systems are also often used as the underlying storage for distributed table systems and distributed databases.
In general, distributed file systems store three types of data:
- Blob objects
- Fixed-length blocks
- Large files
1.2.2 Distributed key-value system
For storing semi-structured data with simple relationships, only primary key-based CRUD functionality is provided.
Similar to a traditional hash table. is a simplified implementation of a distributed table system and is generally used as a cache.
Common data distribution techniques: Consistent hashing.
1.2.3 Distributed table system
It is used to store semi-structured data with complex relationships, supports CRUD, and supports scanning a range of primary keys. Such as: DynamoDB
Multiple rows of data in the same table are also not required to contain columns of the same type.
1.2.4 Distributed Databases
Extended from a stand-alone relational database to store structured data.
- MySQL Sharding cluster,
- Amazon RDS,
- Alibaba OceanBase,
- Tencent TDSQL
- Tidb