Service-Level Objectives (SLOs) are reliability goals for specific behaviors within your software applications and infrastructure. Those behaviors are represented through Service-Level Indicators (SLIs), which compute specific metrics with a ratio or threshold formula. Lastly, Service-Level Agreements can be seen as SLOs that were put into a contract but in reality they’re much more complex legal constructs of a commitment from a vendor to a customer.
We’ve seen beforehand what exactly are the differences between SLAs, SLIs and SLOs on our blog — How AWS Shares SLOs With You
But… Why should one implement SLOs? Let’s have a look at some of the benefits that have been reported by teams who’ve implemented SLOs.
Why SLOs?
SLOs are a means to an end, not the intended outcome. At their core, they are a KPI on a set of reliability metrics. However, the process by which you go through to establish the culture, defining what success looks like, and building SLO monitoring and automations, creates a powerful mechanism that can positively impact many areas of cloud-native organizations.
Adopting SLOs at scale will result in a decrease of the overall number of reliability issues, which will save a company money, improve brand awareness, reduce customer churn, among other things. In order to get there, you’ll find several other benefits along the way.
Establish Reliability Ownership
Improve Visibility And Decision Making
SLOs convert monitoring and observability data into accessible, contextual, and actionable insights. What has traditionally been a blackbox for most functions except SRE/DevOps/Infra, is now simple and easy to understand for most people in the organization.
Reports and automations with SLOs can be configured to help these roles answer the following questions:
CTO
- How should I invest our budget & resources to meet our customer and business reliability needs?
- How is each engineering unit planning, measuring and performing on their reliability goals & responsibilities?
Engineering Manager
- What is the risk of doing a new release for the reliability goals our customers expect?
- What is the reliability performance of the services our team owns?
Software Engineer
- [When responding to an incident] What is this incident’s impact on our reliability goals?
- What have been the reliability levels of the components I have shipped?
Product Manager
- Should we prioritize investment in new features or tackle technical debt?
- Are we meeting the user journey acceptance criteria after software is released to production?
Finance
- How much are we losing due to forecasted and unforecasted reliability issues?
- What is the ROI of our reliability efforts?
Increase Engineering Velocity
By setting a reliability goal under 100%, the SLO provides room to fail — usually called error budget. It can work as a safety net that allows teams to take more risks if there is enough of a buffer remaining. On the other hand, it can inform teams that, according to the reliability goal previously set, there’s no more room for degradations.
This overly simplified view above is for a platform team who build APIs that are used by other engineering teams. These APIs are used in these critical user journeys: user authentication, website loads, and payment transactions. By looking at this information, the platform team can confidently prioritize tasks in the following way:
- We will look into the overall health and the SLOs of the APIs used by the website, check if they are the cause for the service level degradation visible above, and block new releases which may impact that user journey
- We will deploy a new version of the authentication API to production. This is a major change but there’s plenty of room within the weekly SLO should we need to make minor fixes in production
- We will deploy an improvement to the payment processor API to production and observe if a rollback is required. There’s enough error budget within the SLO to run this experiment.
The hypothetical scenario above illustrates how SLOs can help prioritize the work on new features Vs. technical debt in a quick, data-driven and customer-focused manner. This type of insights are particularly useful for product managers.
Reduce Alert Fatigue
Fixed thresholds alerts, either based on infrastructure or application metrics are difficult to keep up to date and representative of the customer experience. With time, teams and on-call engineers learn to ignore alerts because they lack context about the impact to customer experience.
On the other hand, SLOs are almost timeless in the sense that if they were true once, they will likely be true forever (note: you still need to make sure SLIs are still relevant and up to date).
Your customers do not care if CPU consumption in a database cluster is at 80% — a cause-based alert. Your on-call engineers won’t be happy if they get paged with an almost meaningless alert such as that. A relevant alert would be one that notifies an engineer once there is a degradation that will likely cause an underachievement of the reliability goal — a symptom-based alert. Everyone cares if, for example, a service’s response latency was such in the last hour that, at the current pace, the latency SLO will be noncompliant for the week.
SLOs are by definition based on symptoms, not causes. By consequence, SLO alerts are strictly symptom-based and therefore always relevant — from Soundcloud’s Alerting on SLOs like Pros. This kind of alerts being the default for incident management promotes a healthier on-call culture and improves talent retention.
Flag Degradations Earlier
SLOs will reveal how components are actually performing in reference to how we expect them to perform. Once you feel like your SLO setup accurately represents the reliability levels being delivered to customers, SLOs will be the best source of truth to flag degradations that are impacting users.
What you see above is my attempt to help you visualize how cause-based alerts ignore behavior (and often also time). Think of behavior as the typical SLI formula
$${ \text{ Total Good Events } \over \text{ Total Valid Events }} \times 100$$
The SLI starts at 100% and as errors occur during the time window, it decreases.
Cause represents a traditional utilization metric such as high system load or disk usage.
On the left hand side, it’s the old way of doing things — fixed threshold alerts that lack information about the user-facing impact. On the right hand side, it’s the SLO approach. The SLO is looking at behavior over time in reference to a goal. The behavior or SLI degrades over time due to bad events or simply errors during the time window.
The on-caller receiving the threshold alert will typically need to (1) investigate where and how users are feeling the impact of this cause, (2) decide if that impact is worthy of being called an incident, (3) inform the support team of the ongoing incident and/or update a public status page, and finally (4) work on a rollback, fix or escalation. An SLO-based alert is able to automate (1), (2), and (3).
When getting paged by a burn rate alert, the engineer knows immediately (a) what is the customer impact and (b) if the service level is in danger or not. By looking into the SLI specification, they can check which metrics have triggered the alert and then look further into those metrics’ traces and/or logs. With all of this information, working on a fix or corrective action becomes much easier and faster.
In summary, SLO-based alerts carry information about the actual user impact and the reliability goal for the time window. Fixed threshold alerts are arbitrary, difficult to maintain, and inherently worse to capture reliability degradations that impact users.
Check out Google’s Alerting on SLOs book and Alerting on SLOs by Mads Hartmann
Improve & Unify Observability
As you get started with SLOs, you might realize that there are certain metrics missing in order to truly represent the user experience through an SLI. Certain user journeys contain dozens of steps across the client/frontend, the server and the CDN, and most likely you don’t have them properly instrumented yet. That might lead you to improve instrumentation of your code in order to start generating accurate metrics to feed SLOs.
In addition to helping uncover gaps in observability, your SLO setup might end up being the only place where you can centralize insights from the multiple monitoring tools used in your organization. Quite commonly, there tends to be several tools used across different teams. This might happen due to acquisitions, the lack of a common observability strategy, or simply because there’s a culture of using the right tool for the job.
Centralizing raw data from all observability tools is generally an expensive, time-consuming and non value added activity. On the other hand, cherry picking the exact metrics you need from each one of those tools into a centralized SLO platform is straightforward, requires little engineering effort, and adds value from day one.
Rely.io is the reliability intelligence platform where engineering teams build their Service-Level Objective Center. Get in touch to learn more and to join our Beta!