Minio Architecture Introduction

This article was last updated on: February 7, 2024 pm

Brief introduction

Minio is a go written based on the Apache License v2.0 open source protocol object storage system, designed for massive data storage, artificial intelligence, big data analysis, it is fully compatible with Amazon S3 interface, very suitable for storing large capacity of unstructured data from tens of kb to a maximum of 5T. It is a small and beautiful open source distributed storage software.

peculiarity

Simple and reliable: Minio adopts a simple and reliable cluster solution, abandons complex large-scale cluster scheduling management, reduces risk and performance bottlenecks, and focuses on the core functions of the product to create high-availability clusters, flexible expansion capabilities, and superior performance. Establish a large number of small-sized and easy-to-manage clusters, and support the aggregation of multiple clusters into a super large resource pool across data centers, rather than directly adopting large-scale, uniformly managed distributed clusters.

Functionally complete: Minio supports cloud native and can be well connected with Kubernetes, Docker, and Swarm orchestration systems to achieve flexible deployment. And the deployment is simple, only one executable, few parameters, one command can start a Minio system. Minio adopts a metadata-free database design for high performance, avoiding the metabase becoming a performance bottleneck for the entire system, and limiting failures to a single cluster, so that no other clusters are involved. Minio is also fully compatible with the S3 interface, so it can also be used as a gateway to provide S3 access to the outside world. Use both Minio Erasure code and checksum to prevent hardware failures. Even if you lose more than half of your hard drive, you can still recover from it. (N/2)-1 node failure is also allowed in the distribution.

Architecture

Decentralized architecture

Minio adopts a decentralized shared-nothing architecture, where object data is scattered and stored on multiple hard disks on different nodes, providing unified namespace access and load balancing between servers through load balancing or DNS rounding

architecture_diagram

Unified namespace

Minio has two cluster deployment methods, one is the common local distributed cluster deployment and the other is the consortium mode deployment. Local distributed cluster deployment deploys Minio services on multiple local server nodes, forms them into a single distributed storage cluster, and provides a unified namespace and annotated S3 provider. Consortium deployment logically forms a unified namespace for multiple local Minio clusters to achieve near-wireless expansion and massive data scale management, and these clusters can be located locally or in data centers distributed in different regions.

Architecture-diagram_distributed_32

Distributed lock management

Similar to distributed databases, Minio suffers from data consistency issues: while one client reads an object, another client may be modifying or deleting the object. To avoid inconsistencies. Minio specifically designed and implemented the dsync distributed lock manager to control data consistency.

A lock request from any one node is broadcast to all online nodes in the cluster
If consent is received from N/2+1 nodes, the acquisition is successful
There is no master node, each node is peered to each other, and the stale lock detection mechanism is used between nodes to determine the status of nodes and the lock status
Due to the simple design, it is relatively rough. It has certain defects, and supports up to 32 nodes. Scenarios where lock loss cannot be avoided. However, the available needs are basically met.

EC2 Instance Type	Nodes	Locks/server/sec	Total Locks/sec	CPU Usage
c3.8xlarge(32 vCPU)	8	(min=2601, max=2898)	21996	10%
c3.8xlarge(32 vCPU)	8	(min=4756, max=5227)	39932	20%
c3.8xlarge(32 vCPU)	8	(min=7979, max=8517)	65984	40%
c3.8xlarge(32 vCPU)	8	(min=9267, max=9469)	74944	50%

data structure

The Minio object storage system organizes storage resources into tenant-bucket-object forms

租户 - 桶 - 对象

object: Similar to the table xiang table entry in the hash table, the name is a keyword, and the content is equivalent to a value
bucket: is a logical abstraction of several objects, a container for objects
tenant: Used to isolate storage resources. Buckets and storage objects can be created under the tenant
user: An account created in the tenant to access different buckets. You can use the mc command provided by minio to set the permissions for different users to access each bucket

Uniform Domain Name Access

After the Minio cluster extension adds a new cluster or bucket, the client program of the object storage needs to access the data object through the unified domain name/URL, which involves etcd and CoreDns

Storage mechanism

Minio uses erasure code and checksum to protect data from hardware failures and silent data corruption. Even if you lose half of the number (N/2) of the hard drive, you can still recover your data.

Erasure coding is a mathematical algorithm for recovering lost and damaged data, and the current application of erasure coding technology in distributed storage systems is divided into three categories, array erasure coding (Array code: RAID5, RAID6, etc.), RS (Reed-solomon) Reed-Solomon class erasure coding and LDPC (Low Density Parity Check Code) low-density parity test erasure coding. ErasureCode is a coding technology that can add M copies of the original data, and restore the original data through any N points of data in N+M copies. That is, if any data loss is less than or equal to M parts, it can still be restored through the remaining data.

Minio uses Reed-solomon code to split objects into N/2 data and N/2 parity test fast, which means that if it is 12 disks, an object will be divided into 6 data blocks, 6 parity test fast, can lose any 6 disks (regardless of the stored data fast or parity test fast), so that it can recover from the data in the remaining disks.

In an N-node distributed minio, as long as there are N/2 nodes online, your data is safe. However, at least N/2+1 nodes are required for write operations.

After uploading a file to Minio, the information on the disk is as follows:

where xl.json is the metadata file for this object. part.1 is the first data shard for this object. (Each node in the distribution will have these two files, namely data block and parity fast) When reading data, Minio will perform HighwayHash encoding on the encoding fast, and then verify it to ensure the correctness of each encoding. Based on the Erasure Code and Bit Rot Protection’s HighwayHash, Minio’s data reliability is high.

Lambda compute and continuous backup

Minio supports lambda compute notification mechanism, that is, objects in the bucket support event notification mechanism. The supported event types are: object upload, object download, object deletion, and object replication. The current support event acceptance systems are: redis, NATS, AMQP, Kafka, mysql, elasticsearch, etc.

The object notification mechanism enhances the extensibility of Minio, allowing users to implement certain functions that are not implemented by Minio through self-development. For example, metadata retrieval, user’s business-related calculations, etc. At the same time, fast and efficient incremental backups can be made through this mechanism.

Object Storage Gateway

In addition to being a storage system service, Minio can also be used as a gateway, and the backend can be used with distributed file systems such as NAS systems and HDFS systems, or third-party storage systems such as S3 and OSS. With the Minio gateway, S3-compatible APIs can be added to these back-end systems for easy management and portability, because S3 APIs are already a de facto label in the object storage world.

multi-cloud-gateway

Users request storage resources through the unified S3 API, and route each request to the corresponding ObjectLayer through the S3 API Router, and each ObjectLayer corresponds to all APIs that implement object operations on each storage system. For example, after GCS (Google cloud storage) implements the ObjectLayer interface, its operation on back-end storage is implemented through the GCS SDK. When the terminal obtains the bucket list through the S3 API, the final implementation accesses the GCS service through the GCS SDK to obtain the bucket list, and then packages the S3 standard structure to return it to the terminal.

CloudComputing

#K8S #CloudNative #Storage #ObjectStorage

Minio Architecture Introduction

https://e-whisper.com/posts/9462/

Author

east4ming

Posted on

August 10, 2021

Licensed under

Prometheus basic high-availability architecture Previous

Wiki .js configure LDAP authentication Next