Seven Steps to Poems - Quickly Create Effective SLOs
This article was last updated on: July 24, 2024 am
preface
Previous post- How to Configure SLO - Dongfeng Weiming Technical Blog (e-whisper.com) Some commonly used SLOs are introduced, but in the actual process of developing SLOs, they are not necessarily suitable for actual business needs. This article introduces SLO best practices - how to create an effective SLO in 7 steps.
SLI SLO definition
In previous articles - SLA, SLO, SLI Definitions - “Translation” uses Prometheus and Grafana to implement SLO - Dongfeng Weiming Technology Blog (e-whisper.com) , we have already introduced the definition of SLI SLA . Here again a brief mention.
SLI
SLI: Service Level Indicator, i.e. service levelsindexThis is a key indicator of understanding the health of the service and the cornerstone of setting up an SLO.
A typical SLI expression is as follows:
好的事件/所有的事件 * 100%
A typical SLI is: latency for HTTP requests
The expression is as follows:
响应时间小于 5s 的 http 请求 / 所有的请求 * 100%
SLO
SLO: Service Level Object, i.e. service levelstargetis a goal we have set for SLI. Often, SLO is closely tied to time windows.
Typical SLOs are as follows:
- Target 99.9%
- Below that target 😡
- Greater than or equal to the target 😎
A typical example of an SLO:95% of requests have a response time ≤ 5s
Error Budget
The purpose is to:
Develop and innovate with the wrong budget
Allows for better cooperation because of a common goal
- Once the error budget is exhausted, it is primarily enforced (i.e., the deployment is blocked)
- Stay within the wrong budget, then you can incentivize innovation and higher-risk deployments
The definition is:
1 - 可用性目标
。 SREs and Devs work within the wrong budget.
For example, if the SLO is 99.5%, then the error budget is 1-99.5%
It is 0.5%
Then the wrong budget for a month is:
0.5% * 30d * 24h * 60min = 216 min
Let’s take the previous example:
A 95% goal is 5% wrong budget;
The wrong budget for one month is:
5% * 30d * 24h * 60min = 2160 min
Seven Steps to Poems - Best Practices for Creating Effective SLOs
SLO goes beyond basic monitoring metrics; They are a powerful mentoring tool for site reliability engineers (SREs) and DevOps platform teams to help guide improvements in each organization’s CI/CD and production processes.
However, creating a valid SLO can be difficult. According to Dynatrace of the 2022 SRE Status Report, 99% of SREs say they face challenges in defining and creating SLOs. Identifying and implementing effective SLOs requires a well-thought-out and structured approach to success. Here are the recommended steps to implement the right SLOs for your platform and services.
- On the same front
- Identify and prioritize critical services that impact SLAs
- Identify internal stakeholders and align with different teams
- Identify key metrics to use as SLIs
- Identify critical SLOs
- Define an error budget
- Ensure proactive SLO monitoring and alerting
Step 1: Be on the same front
Service Level Agreements (SLAs) is a contractual financial agreement between a supplier and its customer. These agreements define the service levels that customers and end users expect, making them an excellent starting point for understanding how IT can ensure overall business goals are met. Violating SLAs can result in financial penalties, impact revenue, and damage a company’s reputation. Therefore, it is critical to adapt the SLO to meet customer needs.
📝Notes
Service Definition: Service means any feature/functionality of a platform that is provided to external parties.
For example:
For the front end, different User Action It’s a different service;
For the backend, it’s different HTTP requestsMay point to a different service.
Under the dimensions of relevant contracts and SLAs, SRE and DevOps need to agree on the terms and definitions to be used.
This eliminates all the noise and can be done quickly with simple termsDetermine priorities.
SRE or DevOps, as well as development, operations, project management, sales, to identify the services needed to meet SLAs, especially those that customers interact with frequently, or that can cause the most serious problems if they fail.
Step 2: Identify your customer base
Next, you need to determine:
- Which customers does your platform serve?
- How do they interact directly with your platform?
Customers can be people or other platforms that rely on your platform.
For example, different platform services might target different users:
- Internet users
- System administrator
- Call center customers
- Offline customers
- Temporary customers
- Partners
- …
Step 3: Identify services by customer base
All the needs of the platform will eventually be embodied into the services provided by the platform
Different services will target different customers.
This is a very important step and requires some practice/open communication to define the service. Here’s a tip: it’s easier to identify with architecture diagrams.
For example:
- For Internet users, Login is an important service
Step 4: Prioritize the service
Pick the top 2-3 services that matter most for each customer segment, such as: forinternetCustomer base, that is, **LoginandCheckout service, which helps focus on key services.
The criteria for sorting are based on the impact on customers and on finances. For example, the Checkout service for purchasing products takes precedence over the Comparison service for comparing products.
Step 5: Segment customer goals
Customer objectives for services include:
- Availability requirements
- Performance requirements
- The amount of activity for the service
- Correctness of the service
An industry-standard framework is recommended and requires working with your SRE and operations teams to understand what key metrics your observability platform provides and which metrics need to be tracked. There are many types of SLIs to choose from, such as Google’s Four golden signals, RED metrics (Rate Rate, Error Error, Persistence Duration) or USE metrics (Utilization Usage, Saturation Saturation, Error Error).
Continuing the previous example, for:
- Internet customer base
- Sign in to the service
- Customer expectations: Ready to log in at any time
- Customer expectations: hundreds of concurrency per minute
- Customer expectations: fast logins
- Customer expectations: No errors will be reported for successful login
- Sign in to the service
Step 6: Select Specific Indicator (SLI)
Or to continue the previous example, you need to select the first 2-3 important goals through multiple meetings, interviews, and conversations, and refine them:
- User base: Internet customer base
- Service: Log in to the service
- Customer expectations: No errors will be reported for successful login
- Specific indicator: error rate
- Customer expectations: fast logins
- Specific metrics: Response time (duration or latency)
- ~~Customer expectations: You can log in at any time~~
- ~~Customer expectations: the concurrency can reach hundreds of times per minute~~
- Customer expectations: No errors will be reported for successful login
- Service: Log in to the service
Step 7: Establish SLO
Once you have identified the key services and SLIs, you can create the SLO. Ensure that each goal can be measured by realistic, achievable thresholds set for a specific time frame (e.g., hours, weeks, months). The unrealistically high threshold for SLOs will face constant non-compliance. Conversely, the low SLO threshold that is easy to implement makes it difficult to know when an outage will occur. SLO mustMake sense and drive business outcomes, and not just exist as a goal to be achieved. A good way to determine thresholds is to look at how the service is performingHistorical trends。
This is the final step, and the SLO needs to be something you can monitor and should be specific.
There are a few keywords:
- Availability (target)
- Time range
- Specific Conditions (SLI)
Or to continue the previous example:
- User base: Internet customer base
- Service: Log in to the service
- Customer expectations: No errors will be reported for successful login
- SLI: Error rate
- SLO: Error rate < 1% in the last 5min
- SLI: Error rate
- Customer expectations: fast logins
- SLI: Response time (duration or latency)
- SLO: 95% response time ≤ 5s in the last 1 month
- SLI: Response time (duration or latency)
- ~~Customer expectations: You can log in at any time~~
- ~~Customer expectations: the concurrency can reach hundreds of times per minute~~
- Customer expectations: No errors will be reported for successful login
- Service: Log in to the service
summary
To summarize, the key steps to creating an effective SLO:
- Who is mineCustomer base?
- It’s mineserveWhich ones are there?
- What are the key indicators (SLIs) for these services?
- What is SLO?
Finally, finally,monitor.
Monitoring is an ongoing process to ensure you meet SLAs and business objectives. In addition to receiving alerts when SLO violations occur, a better and more proactive approach is to receive alerts when the error budget, burn rate, occurs faster than normal. This method allows you to resolve potential issues before they cause problems. Either way, alerts should be routed to the right team or individual to speed up triage of issues and reduce MTTR.