Why implement the SLO methodology
SLOs are an essential part of a broader methodology known as Site Reliability Engineering (SRE). In the broader sense, SRE is simply a way to take software engineering principles and apply them to infrastructure and operations in order to create highly reliable systems. The assumption here is that operations are a software problem, therefore SRE should use software engineering approaches to solve it! Beyer et al. [2] mentions the following fundamental principles that SRE aims for.
- Minimize toil and automate as much as you can: If a machine can perform an operation, it should! Toil refers to mundane, repetitive operational work providing linearly scalable value with service growth. Time spent on operational tasks means time not spent on projects! Automation is a blessing. When done right, it means that a task is predictable, reliable, and effortless. Free up your personal to focus on what matters.
- Aim for simplicity: Not only in your code but on your approach to reliability. Software simplicity is a prerequisite to reliability. Systems are complex on their own but you should be able to manage, expand and talk about them regardless of this fact.
- Embrace risk: And finally, the most relevant one, reduce the cost of failure and allow yourself to move faster. Forget 100% reliability, it's expensive and unnecessary. Focus only on being reliable enough to meet your user's quality standards.
Hopefully, these sound appealing, but where to start? There are three concepts at the core of this methodology: Service Level Indicators (SLI), Service Level Objectives (SLO), and Error Budgets. Together they form what Hidalgo et al. [1] calls the reliability stack.
Before we go over them, let us highlight the following: The SLO methodology is a continuous and iterative process. The main purpose of SLOs is to provide you with new, meaningful data that allows you to look into your service from your user's point of view. It empowers you to make better decisions that influence how the users experience your product. It won't make your services reliable on its own, and it will probably need to be re-think and adjusted as time goes on. Nevertheless, if used properly, SLOs can become one of the pillars of your decision-making process.
SLIs, SLOs and Error Budgets
Implementing SLOs in an organization
Additionally to what has already been mentioned, SLOs provide you with a new way of discussing your services internally in a way that anyone can understand, regardless of their role in the organization. With the right mindset - an SLO-driven mindset - everyone, whether from marketing, engineering or management can understand statements such as: "We are closer to breaking our budget this quarter than we ever were! What can we do to prevent that from happening?" or "We haven't broken our budget once in the last year, we're becoming more and more reliable!" or even "We only have 20 more minutes of downtime before we break our budget for the month, let's take a step back from introducing new features."
This ease of communication about things that are highly technical, allows you to tremendously reduce friction between different departments within your company. Even so, you probably won't be able to arrive at your desk one day and scream: “Let’s develop SLOs!” and have your entire team stop what they're doing and jump on that wagon. You'll have to start small, be patient, and yet relentless. It's a mindset that grows with time and becomes more and more valuable as everyone in the team gets involved. Don’t be discouraged if change doesn’t happen immediately.
Continue discussing these tools and concepts with your team, experiment with new ideas, and continuously move towards better monitoring and reliability agreements. Look for small wins as you go through this process and engage with everyone around you. More importantly, iterate over everything! Make adjustments to your SLO configurations and targets to keep them up to date with the evolution of your product and the feedback you receive from your colleagues.
Soon, we'll have another blog post describing how you can bring the SLO mindset into your organization, but for now, we'll leave you with the step-by-step priority guide created by Hidalgo et al. [1]:
- Get buy-in. - Communicate how SLOs work and get everyone in agreement that they provide value
- Prioritize SLO work. - Get the work on your roadmap, assign it to one or more people, and make it a priority.
- Implement your SLOs. - Decide what SLIs to track, how to monitor them, and what level of reliability you want to provide, and learn how you’re performing against those targets.
- Use your SLOs. - Decide as a team how to alert on your SLOs, how to use your error budget, and how to inform work priorities using your SLOs.
- Iterate on your SLOs. - Discuss what is and isn’t working, add/remove/adjust your SLIs/SLOs, and continually revisit your SLOs to check that they reflect your stakeholders’ needs.
- Advocate for others to use SLOs. - Use what you’ve learned to educate others about the benefits of SLOs.
Setting up SLIs and SLOs @ Rely.io
We at Rely.io work every day to make your journey into SRE as convenient as possible and we are always exploring and learning to continue to do so! By using our platform, you'll be able to centralize all your SRE-related tasks in a way that enforces best practices, is easily manageable and expandable.
You may configure the Services and User Journeys that are of importance to you, on top of which you may define your SLOs, so that you are always aware of the state of each part of your system. When defining your SLOs you may also add other useful parameters such as its criticality level, its review date, and the person, or team, in your organization responsible for monitoring its performance.
Define your SLIs using as many metrics as you'd like and a variety of evaluation methods to tune it in to your exact needs.
Visualize how the SLI, error budget, and the rate of error budget consumption evolve over time in order to make the best decisions possible at any given time.
And we're just getting started! We are continuously improving our product offering and plan on adding many other features that will facilitate the implementation of SRE best practices within your company.
Bibliography
[1]Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets" by Alex Hidalgo | 2020
[2]"The Site Reliability Workbook: Practical Ways to Implement SRE" by Betsy Beyer, Niall Richard Murphy, et al. | 2018