How AWS Shares SLOs With You

Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"

Industry

How AWS Shares SLOs With You

António Araújo

Go To Market Lead

Rely.io

April 21, 2022

•

min read

tl;dr — AWS shares availability SLOs for dozens of their services. Full list available here

‍

Reliability Glossary

“SLIs drive SLOs which inform SLAs”

‍From Google Cloud Tech

‍

Service-Level Agreement (SLA)

A promise, usually written into a contract between two parties, of the acceptable performance of a service over a certain time period. Failure to meet the promise may result in penalties, such as refunds or issuing of compensation credits.

Example SLA: A Software as a Service (SaaS) API provider saying that 99% of all responses are delivered within 100ms, tracked on a monthly basis.

Service-Level Indicator (SLI)

The actual metric used to measure performance.

Example SLI: On the SLA described above, the SLI is the percentage of responses that were indeed delivered within 100ms in the month. So if this SLI is at 98%, the provider is not in compliance with the SLA.

Service-Level Objective (SLO)

Mathematically, the SLO and SLA are the same concept. SLAs are associated with customer-facing promises around reliability and performance. SLOs, while having the same formula, are an internal mechanism for increased productivity and accountability across teams and services.

Example SLO: An SLA must always be defined and monitored by a more strict SLO. For instance, the SLA above should have a higher SLO such as 99.5% instead of the 99%.

‍

Learn more about SLAs, SLIs, and SLOs with our In-Depth SRE Guide

‍

SLAs in SaaS

Why do you need a SaaS SLA?

As a SaaS customer, you need an SLA because the vendor’s expected reliability and performance must match your own requirements. For example, if you offer a 99% monthly uptime SLA to your customers, you probably shouldn’t use a cloud database provider that only guarantees 98% availability.

Furthermore, an SLA ensures that the right incentives and commitments are in place. On the one hand, it shows that the provider is prepared to mitigate incidents should they occur; on the other hand, it creates a mechanism to terminate the relationship or be compensated if the vendor fails to comply with the SLA.

If you provide a service, however, it is also within your interest to offer an SLA because you must always manage customer expectations, even by sharing how likely service can be down and what happens if it does. By codifying the minimum requirements for each level of service and by clearly specifying the service parameters, you ensure clear communication and transparency of what’s being offered. In Business-to-Business (B2B) SaaS, an SLA is mandatory if you’re selling to large companies. Additionally, a company that speaks clearly and confidently of their SLAs is more likely to impress buyers.

Examples of public SaaS SLAs

Google Translate SLA

Types of SLIs: Availability

Compensation for downtime starts at: <99.9%

Twilio SLA

Types of SLIs: Availability

Compensation for downtime starts at: <99.95%

Microsoft Azure CosmosDB SLA

Types of SLIs: Availability, Throughput, Consistency, Latency

Compensation for downtime starts at: <99.9%

‍

Does Amazon really share SLOs with customers?

Spoiler alert: Yes, they do.

‍

Amazon Web Services (AWS) counts with some really high-profile customers, including major healthcare providers and governments worldwide. As you’d expect, SLAs are table stakes for AWS the same way they are for any other major public cloud provider.

If you visit AWS’ SLA repository, you’ll see that they list 147 services with publicly available SLAs (at the time of writing). What many people don’t know is that many of those services also have public SLOs, mostly for availability. AWS calls it their Availability Design Goal and makes it accessible here.

Having worked at AWS myself, I remember how spectacular it was to talk about the 11 9's of durability. I’m talking about Amazon Simple Storage Service (S3) being designed to provide 99.999999999% of durability over a given year. In simple terms, that means if you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years — S3 FAQs.

I’ve only recently learned about the SLO methodology but later I realized that I was already familiar with the concept because of S3. And S3 is a great example to help explain the difference between an SLA and an SLO. Let’s look at the language used by AWS to talk about availability:

The S3 Standard storage class is designed for 99.99% availability

The key term here is designed for which helps identify this percentage as the SLO. The availability SLA (or contractual SLO) is also shared by AWS and, as you’d expect, it is lower than the availability design goal — 99.9% < 99.99%.

The 11 9’s of durability I was so enthusiastic about before, on the other hand, don’t make it into S3’s SLA. Don’t get me wrong, data durability might not be part of the contractual SLA but it’s still impressive and should be praised that the durability SLO is publicly available.

‍

Conclusion

SLOs are a powerful mechanism to measure and to manage reliability as complexity increases in the software, infrastructure and engineering processes. The large majority of organizations should not be like AWS and share them externally. However, products, services and user journeys should be represented and monitored through different kinds of internal SLOs, such as availability and latency. When these SLOs have owners and are taken into account for decision making, teams are able to move faster and build systems that are more resilient.

If you are a B2B SaaS company, ensure that you’re monitoring SLOs that are more strict than your customer SLAs. Otherwise, you won’t be able to take proactive measures to fix issues that will eventually impact your SLA compliance. Furthermore, by always keeping a record of an SLI’s performance, you’ll easily verify compliance every time a customer submits a service credit request.

‍

António Araújo

Go To Market Lead

Rely.io

On this page

Contributors

Request access

Request access

See related articles

How to Measure the Real Impact of Developer Productivity Initiatives?

How to Unlock Engineering Excellence with Centralized Metrics

How to Structure, Set Up, and Conduct an Effective Engineering Operational Review

Follow our simple guides to get set up in minutes