For the past 5 years, I’ve partnered with dozens of fast-growing startups while at AWS and became part of one later at Unbabel. I’ve observed how engineering teams still need to invest large amounts of time ensuring systems are reliable, no matter how much easier that became when public cloud providers showed up.
Cloud-based architectures are getting more complex and with more and more third-party dependencies. Building reliable systems in 2022 is still incredibly complex for engineers. In addition to this, SRE teams are lacking the accurate data about how their work is impacting, positively or negatively, the customer journeys over time.
For executive or business teams, the site reliability function is often oversimplified around a sort of “how often were the systems down last quarter” kind of thesis. Although of massive importance, it can’t be the single focus as SREs have additional concerns such as latency, security, developer experience, infrastructure costs, etc. If enforced to the limit, this rationale might bias engineers towards availability and fixing short-term issues. A shortsighted focus on basic threshold alerting can lead to a disregard of actual long-term reliability-ensuring tasks.
By agreeing on Service Level Objectives (SLOs) with the business, SRE teams can now create data-driven goals or OKRs, accurately report on the starting metric and corresponding results, and in consequence showcase to the organization the ROI of their work — all by tracking those SLOs on Rely.io. The thesis can now evolve to a longer, holistic approach:
- How often did the systems go down this quarter compared with the previous one?
- Are we complying with the organization's agreed availability/latency/throughput goals?
- What was the ROI of the investments made in SRE?
- In which user journeys will we make reliability investments next quarter?
Behemoths like Google, Microsoft or Netflix follow an SLO-based approach to site reliability with lots of internally-built components by hundreds of site reliability engineers working at each of these companies.
In early 2021, I met José Velez for the first time and he told me he wanted to productize this framework to make it available for any organization, no matter its size. And it’s been exciting to watch José and the rest of the team succeeding at doing it. The Rely.io platform is up & running in pre-beta with over 10 organizations using it weekly to:
- Create SLOs in a few clicks — really easy, I’ve even done it myself already!
- Prioritize between new features vs technical debt with SLO data
- Reduce on-call engineer’s alert fatigue
- Capture issues that are slowly depleting their error budget
- Integrate their monitoring tools, such as Amazon CloudWatch, Prometheus, New Relic, Elasticsearch and/or Datadog (with more coming soon)
Now, I’m joining Rely.io to lead our Go-To-Market activities and I’m really looking forward to speaking with engineering teams around the world about their reliability efforts!
If you are an engineer or engineering leader looking after site reliability / DevOps, please reach out if you’d like to learn more.