tl;dr — AWS shares availability SLOs for dozens of their services. Full list available here
Reliability Glossary
“SLIs drive SLOs which inform SLAs”
From Google Cloud Tech
Service-Level Agreement (SLA)
A promise, usually written into a contract between two parties, of the acceptable performance of a service over a certain time period. Failure to meet the promise may result in penalties, such as refunds or issuing of compensation credits.
Example SLA: A Software as a Service (SaaS) API provider saying that 99% of all responses are delivered within 100ms, tracked on a monthly basis.
Service-Level Indicator (SLI)
The actual metric used to measure performance.
Example SLI: On the SLA described above, the SLI is the percentage of responses that were indeed delivered within 100ms in the month. So if this SLI is at 98%, the provider is not in compliance with the SLA.
Service-Level Objective (SLO)
Mathematically, the SLO and SLA are the same concept. SLAs are associated with customer-facing promises around reliability and performance. SLOs, while having the same formula, are an internal mechanism for increased productivity and accountability across teams and services.
Example SLO: An SLA must always be defined and monitored by a more strict SLO. For instance, the SLA above should have a higher SLO such as 99.5% instead of the 99%.
Learn more about SLAs, SLIs, and SLOs with our In-Depth SRE Guide
SLAs in SaaS
Does Amazon really share SLOs with customers?
Spoiler alert: Yes, they do.
Amazon Web Services (AWS) counts with some really high-profile customers, including major healthcare providers and governments worldwide. As you’d expect, SLAs are table stakes for AWS the same way they are for any other major public cloud provider.
If you visit AWS’ SLA repository, you’ll see that they list 147 services with publicly available SLAs (at the time of writing). What many people don’t know is that many of those services also have public SLOs, mostly for availability. AWS calls it their Availability Design Goal and makes it accessible here.
Having worked at AWS myself, I remember how spectacular it was to talk about the 11 9's of durability. I’m talking about Amazon Simple Storage Service (S3) being designed to provide 99.999999999% of durability over a given year. In simple terms, that means if you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years — S3 FAQs.
I’ve only recently learned about the SLO methodology but later I realized that I was already familiar with the concept because of S3. And S3 is a great example to help explain the difference between an SLA and an SLO. Let’s look at the language used by AWS to talk about availability:
The S3 Standard storage class is designed for 99.99% availability
The key term here is designed for which helps identify this percentage as the SLO. The availability SLA (or contractual SLO) is also shared by AWS and, as you’d expect, it is lower than the availability design goal — 99.9% < 99.99%.
The 11 9’s of durability I was so enthusiastic about before, on the other hand, don’t make it into S3’s SLA. Don’t get me wrong, data durability might not be part of the contractual SLA but it’s still impressive and should be praised that the durability SLO is publicly available.
Conclusion
SLOs are a powerful mechanism to measure and to manage reliability as complexity increases in the software, infrastructure and engineering processes. The large majority of organizations should not be like AWS and share them externally. However, products, services and user journeys should be represented and monitored through different kinds of internal SLOs, such as availability and latency. When these SLOs have owners and are taken into account for decision making, teams are able to move faster and build systems that are more resilient.
If you are a B2B SaaS company, ensure that you’re monitoring SLOs that are more strict than your customer SLAs. Otherwise, you won’t be able to take proactive measures to fix issues that will eventually impact your SLA compliance. Furthermore, by always keeping a record of an SLI’s performance, you’ll easily verify compliance every time a customer submits a service credit request.