PublicCloud best practices for cost reduction and efficiency increase

This article was last updated on: July 24, 2024 am

preface

Recently, I have seen several things, one is an insurance system, in order to quickly go online, the full amount of the cloud is on the cloud, and the monthly bill is as high as hundreds of thousands after the production is officially running. The related business could not bear this expenditure, and the following project managers, developers, operation and maintenance, and architects spent 3 months migrating the business from the public cloud. The people involved were tortured half-dead and greatly slowed down the iteration of the system.

The other is a case of an e-commerce company, the fee bill is also very high at the beginning after going to the cloud, close to 200,000 per month, after the optimization of “cost reduction and efficiency increase”, the cost is greatly reduced, the monthly cost is reduced to about 40,000, and the service quality has improved.

These 2 stories tell us that the cloud era is coming, and we should also optimize costs to the maximum extent without reducing service quality according to the characteristics of public cloud (such as elasticity, flexibility, multiple billing methods, etc.).

Here are some best practices.

Best practices

overall

Prefer public cloud services over self-built

In addition to IaaS (compute, storage, networking, etc.), the public cloud also provides PaaS (microservices, middleware, databases, big data, development kits, etc.) and SaaS services.

Under the premise that the services provided by the public cloud (such as MySQL database) can meet the requirements, we recommend that you prefer the MySQL database service on the public cloud instead of building your own.

The reasons are simply explained as follows:

Comparison Cost Performance Scalability Maintainer Reliability Monitoring Ease of use
Self-built High Low Weak We Low None Hard
Cloud Services Medium High High Cloud providers High Yes Ease of use
  • cost
    • Self-built, personnel maintenance and optimization costs, need to consider high reliability (may need to purchase multiple servers) and high performance (may need to purchase high-performance storage), making the cost high.
    • Cloud services are relatively inexpensive to achieve through scale effects, resource pooling, and parameter tuning.
  • performance
    • Self-built, may not know all the parameter optimization items, nor may you be able to buy dedicated high-performance hardware at the same price.
    • For cloud services, the performance price is clearly marked, and you can choose the performance configuration that suits you according to your needs.
  • retractility
    • Self-built, scaling is more troublesome, either manual, or through the tried and tested DevOps script, weak scalability.
    • Many PaaS services can be upgraded with one click.
  • Maintainers
    • Self-built, we do our own pocket
    • Cloud services, cloud providers provide SLA back-up.
  • reliability
    • Self-built may not implement all the best practices of the cluster mode or high availability mode of this component.
    • Cloud services ensure reliability by providing high network availability (even cross-AZ), storage of multiple copies, computing across physical servers, racks, AZ, and even regions, service monitoring and self-healing, and backup.
  • monitor
    • Self-built, either no monitoring, or monitoring needs to be implemented from the beginning (collection end) to the end (alarm notification).
    • Cloud services, monitoring is available, and it seamlessly connects with public cloud monitoring.
  • Ease of use
    • Self-built: Generally, there is no web interface, and it needs to be applied and operated through offline or process platforms or CLIs
    • Cloud services: There is an easy-to-use web interface where most functions can be completed.

For example, a cloud database:

  • O&M architecture:
    • The scale and later expansion of stored data, using a high-availability architecture;
    • Abnormal switching
  • Hardware and infrastructure deployment
    • Select what server to configure, server model and corresponding disk array;
    • Operating system environment and kernel settings;
  • Database installation and optimization
    • Database version installation, deployment and configuration;
    • Database configuration parameter tuning;
  • SQL statement optimization;
    • Slow query, optimized for SQL statements and indexes
  • Daily database backup and recovery.
    • Backup;
    • Hot or cold spare? Physical or logical?
    • Backup policy, period, frequency

With a cloud database, the cloud database does all these steps for you. Other PaaS (middleware, big data, microservices, DevOps, etc.) are similar.

Be safe

The biggest risk to public cloud is data breaches. So be sure to do a good job of security. This security is multifaceted. See you for details safe Part.

The advantage of the cloud is “distributed”

If you compare a single server, the performance of the cloud host may be worse. “Distributed” is the biggest advantage of cloud computing. In practice, we should not only pursue the performance of a single machine, but ensure the high performance of the business through distributed design ideas. Best practices are recommended, the server is equipped with 4C8G as standard, and 2C4G configuration can also be used for low configurations. Through distribution, the resources of a single server are fully squeezed, so as to maximize the final low cost.
Therefore, on the cloud, the selection condition of application servers in general is that more low-end cloud services are better than fewer high-end cloud servers.
Therefore, on the cloud, for the database, if the amount of data is very large, it is also recommended to use a “distributed database” instead of building your own Oracle on the cloud.

The advantage of the cloud is “elasticity”

Therefore, on the cloud, do not purchase all resources according to the peak of the business, but recommend:

  • Buy resources that meet your daily needs
  • At peak times, purchase some elastic resources in advance to elastically expand capacity.

In addition, not only server resources, but also for the network, if your system is often engaged in activities and the network load gap is large, then it is recommended that “large bandwidth pay-as-you-go” instead of “fixed bandwidth fixed billing”.

Dynamic and static separation

Static: Put it on CDN + object storage, or put it on the NGINX server, do not directly use the application server (such as tomcat or nodejs) to process static resources. (Wasteful, there is a specialization in the art industry)
Dynamic: Typical architecture is LB - NGINX - Application Server - Redis - Database.

Evaluate your business volume before going to the cloud

Evaluate the business volume before going to the cloud and make a resource budget for resource planning on the cloud.
You can do this by:

  • Stress test
  • Monitoring data analysis is already available

and other ways to evaluate business volume.

Commonly used traffic metrics are as follows:

Indicators Calculation period Indicator meaning
PV Day Page View。 It refers to the number of page visits in a day for Web services in the B/S architecture, and each page that is opened or refreshed is counted as a PV.
UV Day UV is short for Unique Visitor. Refers to the number of users who visit the site in a day (based on cookies) for web services in the B/S architecture
IP Day In the B/S architecture, how many independent IP addresses the page in a day, that is, the number of different IP browsing users is counted.
Number of users Number of registered users of the business system
Number of active users Day Among the number of registered users, the number of users who actually used the business system in a day is the same as the concept of UV
Number of online users Day The number of active users in a day that users were online at the same time in a certain period of time
Number of concurrent users Refers to the number of users who send requests to the server at a given time based on the number of online users

How to convert specific performance monitoring metrics with business metrics is omitted.

Recommend several public cloud products

These public cloud products are basically used, tried-and-tested, and more practical products.

  1. Cloud server
  2. Relational databases
  3. Load balancing
  4. Object storage
  5. VPC (Virtual Private Cloud): A private network
  6. CDN
  7. Redis
  8. Basic security products (such as security groups, ACLs, missed scans, WAF, DDoS protection, etc.)

compute

The configuration of CVMs is mainly medium and low-level

ECSs are generally used to carry applications, and it is recommended to use more medium- and low-end configurations to avoid waste of resources.
It is recommended that the general configuration should not exceed: 16C32G, and the mainstream configuration is:

  • 4C8G and even lower
  • 8C16G

We recommend that you use Container Service

Container Service has many advantages, and we recommend that stateless applications use Container Service.

  • Resource granularity is more fine, fine-grained to: 0.1C, memory MB.
  • Automatic scaling is more convenient
  • After the capacity is expanded, pods are automatically added to Server Load Balancer

Pay as you go

On the cloud, instead of purchasing all resources according to business peaks, we recommend that you buy resources that meet your daily needs

Elastic expansion

On the cloud, if you need to engage in activities, you can purchase and elastically expand through containers or APIs + images + snapshots.

For example, in a mobile phone flash sale event, 200 machines will be turned on instantly for 2H to respond, and IT resources will cost 600 yuan:

  1. Build the environment and make a good image;
  2. Before the event, 200 servers were opened in seconds through the API to cope with the event;
  3. After the event, resources are instantly released through the API

This is simply unthinkable in traditional architectures. For example, traditional architecture, to engage in “Double Eleven”, IT resources must be prepared one month in advance.

In addition, the above scenario can also be dynamically adjusted by using “Auto Scaling Service” or “Container HPA”.

Use DevOps tools like Ansible

Since the advantage of the cloud is “distributed” and has many resources, a batch DevOps tool like Ansible is essential to greatly improve efficiency.
For specific applications, you can customize the corresponding Playbook through Ansible to automate batch installation and O&M.

Improve cloud deployment efficiency with images

First open a cloud server, and take measures such as system optimization and security reinforcement in terms of operation and maintenance specifications for this cloud server. Then, make this cloud server into a basic image and turn up other servers of the same environment in batches, which can greatly improve the deployment efficiency.

Internet

Domain name filing must be done first

The final step in migrating to the cloud is to resolve the IP address of the domain name to the public IP address of Server Load Balancer. However, it is necessary to file the domain name on the public cloud in advance, and if the domain name is finally resolved to the public cloud, it is found that the domain name has been blocked and business access is denied, which will become very troublesome. Therefore, if you need to file a domain name through the public cloud in advance, or if you have already filed with another provider, you need to transfer the domain name ICP Filing to the public cloud.

Recently, the international situation has intensified, Sino-US frictions have further escalated, and the situation is tense. Prepare for the worst: The U.S. may cut off the resolution of your .COM domain.
In addition, the country has recently issued guidance not to use foreign-controlled root domain names as first-level domain names for infrastructure.
.cn is the national root domain, .com.cn, net.cn, org.cn, etc. are all possible.

It is strictly forbidden for each server to access the Internet

For security (the attack surface is too large) and cost (public IP is money).
And it is not necessary, if it is a service access, the inbound direction comes in through load balancing, and the outbound direction goes out through the NAT gateway.
For O&M, we recommend that you use VPN + Jump Server (small and medium-sized enterprises) or leased line + bastion host (large enterprises) to implement O&M management.

If you need to go to the Internet, we recommend that you use a NAT gateway instead of binding a public IP address to a machine

Why: Higher reliability and safety.

Take advantage of low-cost, high-load bandwidth on demand

For small and medium-sized enterprises, if your system is often engaged in activities and the network load gap is large, it is recommended to “pay as you go for large bandwidth” instead of “fixed bandwidth fixed billing”, for example: “1Gbps peak bandwidth pay-as-you-go” versus “100Mbps fixed bandwidth”:

  • Fees may be lower
  • The bandwidth is larger, and it may exceed 100Mbps during the event, then the fixed bandwidth will affect the user experience, and the 1Gbps peak bandwidth is completely supportable.

For example, before and after a customer goes to the cloud, in an IDC data center, the annual cost of 200 Mbps of exclusive telecom bandwidth is about 1 Mbps/100 yuan/month x 12 months x 200 = 240,000. In the cloud, BGP multi-line SLB bandwidth with a peak of 1Gbps has improved bandwidth quality by several orders. In addition, bandwidth fees are paid as you go, which greatly reduces maintenance costs.

We recommend that you use Cloud Soft Server Load Balancing

We recommend that you use Server Load Balancer provided by the public cloud as a reverse proxy to prevent security and stability risks caused by clients directly connecting to ECSs.

Adding load balancing ensures flexible scalability of the architecture: After adding load balancing, the architecture becomes more flexible. A typical scenario is to bind all domain names to Server Load Balancer, and then go to the backend NGINX to do more flexible control at layer 7, such as virtual hosting.

In high concurrency cases, we recommend that you use Layer 4 load balancing

Layer 4 load balancing is used to ensure performance: In practice, in the face of scenarios with high concurrent performance, it is found that the load balancing of layer 7 has a big gap in performance compared with layer 4 load balancing. Layer 7 load balancing can only reach 10,000 levels of concurrency, while Layer 4 load balancing can reach hundreds of thousands or even millions of levels of concurrency. Therefore, in e-commerce and other website applications, for load balancing, the TCP layer is preferred. Four-layer LB can withstand 10w-50w concurrency.

Be aware of DNS record adjustments

The DNS TTL of users is beyond our control, and if the DNS record of a domain name is adjusted, some users may have taken effect and some users may not have taken effect.
In response to this situation, it is necessary to do 302 redirection on the original IP to direct customers who still access the original IP to the new IP, which will greatly improve the user’s access experience.

Large Enterprise - DNS load balancing practices

Large-scale adoption. When there are one to two hundred ECSs in the backend, and the performance of one server is limited, multiple servers can be used for load balancing, and the front side is load balanced through DNS. Typical example: Taobao, Alibaba Cloud official website.

One of the biggest problems with DNS is local DNS caching.

  1. This can make DNS TTL take effect a little faster;
  2. DNS is configured with a Server Load Balancer IP, not an ECS IP.
  3. If some users still have problems, instruct users to clean the DNS cache or force binding the local host to point to domain name resolution.

Intelligent resolution—essential in cross-region distributed architectures. Based on the ClientIP, select the IP address of the corresponding region and carrier.

Operator line analysis

Such as: DNS records:

  • Default line: Telecom server IP
  • Netcom: Netcom IP
  • Mobile: Mobile IP
  • Education Network: Education Network IP
  • Overseas: Overseas IP

If there is a BGP line, it is simpler:

  • Default Line: The public IP address of Server Load Balancer
Regional route analysis

For example, if a user requests to access a domain name, DNS automatically determines whether the visitor’s IP is “Shanghai Unicom” or “Beijing Unicom”, and then intelligently returns the corresponding server IP addresses of “Shanghai Unicom” and “Beijing Unicom” to complete domain name resolution.

Overseas: You can select “Overseas, Overseas Continent, Overseas (Country)” to subdivide the analysis.

As:

  • Overseas - Asia - Singapore line: IP to the Singapore server
  • Overseas-North America-US line: The IP that points to the US server
  • Overseas-Europe-Germany line: IP to the server in Germany
  • Default line: The IP that points to the Singapore server

CDNs are best practices for intelligent resolution

storage

Making Good Use of Object Storage Services on the Cloud

We recommend that you do not use NAS-like shared file storage services on the cloud, but should use themObject Storage Serviceto replace.
In a traditional environment, NAS is typically used in the following scenarios:

  • Load balancing: Business that uses LB + multiple cloud servers (such as web servers). Multiple cloud servers need to access the same storage space so that multiple cloud servers can share data.

    • Alternatives: Use a common cloud data disk directly to achieve batch deployment and data consistency through DevOps and other tools.
  • Code sharing: Multiple ECS applications, the deployed code is consistent. Put the code in the same storage space and provide multiple cloud servers with simultaneous access. Code is centrally managed.

    • Alternatives: Code is centrally managed in the repository.
  • Log sharing: Multiple ECS applications need to write logs to the same storage space for centralized log data processing and analysis

    • Alternatives: Logs are regularly stored in object storage, and can be stored in standard object storage, in-frequency object storage, and archive storage according to the actual situation of hot and cold data. Or use Log Service directly on the cloud.
  • Enterprise office file sharing scenario: Enterprises have common files that need to be shared with multiple groups of businesses and need centralized shared storage to store data.

    • Alternatives: Use object storage instead.
  • Scenarios for Container Service: The deployed container service requires shared access to a file data source, especially in the container service of Resource Orchestration. The corresponding containers may drift in different servers, so file share access is particularly important.

    • Alternatives: This scenario does require the use of cloud file system services.
  • The scenario of the backup: If you want to back up data to the cloud, you can mount a file system to store data backups.

    • Alternatives: Backup to “archive storage” of object storage to further reduce costs.

Incorrect usage: NGINX forwards to Object Storage over the Internet

In a customer scenario, static resources are placed in object storage, and front-end requests for static resources are forwarded to object storage through Nginx reverse proxy. However, this approach is not recommended on cloud architecture, because it brings several problems:

  • Traffic accessing static resources goes to the bandwidth traffic of cloud servers, especially in medium to large web applications. Traffic goes to the bandwidth of the cloud server, and performance bottlenecks are likely to occur.
  • NGINX forwards requests to Object Storage through the public network reverse proxy, so it will affect the speed and performance of network transmission.
  • Through Nginx reverse proxy, not only increase O&M costs, but also maintain NGINX configuration files.

Therefore, adding Nginx as a reverse proxy is redundant. This is not recommended in the cloud. The main reason for this customer is that on the business code side, requests for static resources are divided through directories. If static resources are placed separately in the second-level domain name, cross-domain problems and other problems are not well solved on the code side, resulting in this kind of nondescript architecture. Finally, the optimization and adjustment is made on the business code side, and the usage specifications for object storage static resources are as follows:

  • The business side uses a separate second-level domain name to manage static resources (such as <pic-cdn.e-whisper.com>), and static resources are uniformly placed in Object Storage.
  • The second-level domain name of static resources directly binds the CNAME to the URL address of the object store (if the number of visits is small), so that the redundant step of “using Nginx as a reverse proxy” is skipped
  • If you want to further improve the access speed of static resources stored in Object Storage, you can seamlessly access the CDN. CDN back-to-origin requests directly request the source data in Object Storage through the intranet back-to-origin. Compared with NGINX reverse proxy public network request object storage, the speed and efficiency will be higher, and the price will be more cost-effective in certain cases.

database

We recommend that you directly use the relevant database services of the cloud provider, and we recommend that you ensure high availability, such as cluster mode or multiple replicas, and data backup.
The database preferentially adopts the relevant database services of the cloud provider, which is cost-effective and efficient: If you purchase a cloud server to build a self-built MySQL master-slave deployment and maintenance mode, the later maintenance and management costs are very large. That is, we need to monitor and maintain the status of master and slave, and need to deal with problems in time to ensure the continuity of business reading and writing to the database. These issues can be automated with the adoption of the cloud provider’s relevant database services. That is, the monitoring, backup, post-maintenance, and failover of the database master and slave are fully automatic.

For DBs with particularly high reliability requirements, you can choose a cluster solution with high availability across AZ

For DBs with particularly high reliability requirements, you can choose a cluster solution with high availability across AZ. For example, Redis, MongoDB, and MySQL all have similar cross-AZ high-availability clustering solutions.

Choose the right database for your needs

There are a variety of databases, choose according to your actual needs, the following are listed sections:

  • Relational databases
    • MySQL
    • SQL Server
    • Postgresql
    • MariaDB
    • Distributed databases (such as OceanBase or TDSQL, etc.)
  • Non-relational: In-memory database
    • Redis
    • Memcache
  • Document database: MongoDB
  • Column database
    • HBase, etc
  • Time series database
    • InfluxDB
    • TSDB

CDN

Typical usage scenario: CDN + object storage

  • Data distribution: Suitable for building apps, audio and video platforms, websites, etc. with more download behavior, users can combine CDN + object storage capabilities to host static content (including audio, video, images, etc.) in object storage, and deliver hot files to CDN edge nodes in advance to reduce download and access latency
  • Website hosting: Suitable for static sites such as official websites, the static resources of the website are quickly hosted and stored in object storage, and distributed through CDN + object storage, and the domain name configured by CDN is used as the access address entrance of static website visitors to quickly build a website

safe

You must set a strong password

Typical example: MongoDB, Redis, ES, default no password or weak password, has occurred many rounds, large-scale data leakage incidents, so for these services, be sure to set a strong password.
As for cloud servers, cloud accounts, relational databases, etc., it is necessary to ensure strong passwords or stronger security measures.

Client access must be HTTPS

I won’t say much about that.

  • Request a certificate for the domain name and put it on Nginx or LB management.
  • On the business side, keep the HTTP 80 port and do the redirection of 80-> 443. Both ports 80 and 443 listening on the LB are enabled.

Be sure to configure security groups and ACLs

The most basic security protection

Do not root directly

Do not connect directly to root, use ordinary users, log in to the past sudo on demand to switch to root

We recommend that you do not use 22 for the SSH port that exposes the Internet

It is recommended not to use the default port 22 to prevent scanning. In addition, it is recommended to use certificate certification and other methods, so I will not repeat them one by one.

Don’t forget to get your free security products

For example, every cloud server that is activated will be given some free quota of “DDoS protection and host security protection”. With basic protection, it will be much safer than running naked.


PublicCloud best practices for cost reduction and efficiency increase
https://e-whisper.com/posts/59535/
Author
east4ming
Posted on
August 25, 2021
Licensed under