Reliability from different lenses
Product and engineering teams share the goal of providing a superior user experience through their apps, but they approach this from two distinct perspectives.
For product teams, this objective translates into the careful design of user journeys, the series of actions taken by a user to accomplish a particular goal. Each action in this journey is expected to have a dependable, predictable, and consistent response ensuring a reliable and seamless experience.
On the other hand, engineering teams are entrusted with the construction and maintenance of the infrastructure and software supporting an application. They are tasked with ensuring the efficient functioning of a horde of services, each carrying several operations. For instance, an API hosts several endpoints performing different functions. The reliability of each endpoint must be separately evaluated to ensure the overall service's reliability.
In essence, product teams view reliability as the assurance of quality in user journeys and the actions within them, while engineering teams perceive it as guaranteeing that services run smoothly and perform their operations accurately. These perspectives converge via the fact user actions prompt one or more service operations.
Ultimately, reliability is about ensuring the quality of the underlying engineering systems. But viewing it through these dual lenses facilitates a clearer connection between business responsibilities and technical considerations, enhancing coordination, observability, and communication.
The Product Catalog
Steering Reliability Initiatives: A Practical Example
Let’s now transition to a practical application of this knowledge and use the product catalog as the backbone of a structured reliability initiative.
Imagine you're an SRE at a prominent streaming company. Your mission? Launch a reliability drive across the organization. A streaming company's core objectives revolve around two pivotal user journeys: 'Subscription Purchase' — ensuring a frictionless subscription process, and 'Content Streaming' — guaranteeing a smooth content delivery. In this context, let's zero in on the 'Subscription Purchase' user journey. This journey comprises several steps:
- Accessing the subscription purchase form
- Submitting it
- Being redirected to the purchase summary page.
As we know by now, beneath these seemingly simple interactions may lie a complex technical machinery. Let us visualize the dependencies between actions, services and other services in the diagram below.
Identifying these dependencies can be tricky. Regardless of the method, the outcome is a comprehensive understanding of which operations to monitor for ensuring the reliability of the entire user journey. The final leap? Implementing Service Level Indicators (SLIs) to track these operations and setting up a target they should abide by, their Service Level Objectives (SLOs), based on the desired user experience.
Establishing SLOs at this stage enables a more streamlined configuration process. It crystallizes what constitutes a business-impacting problem and provides clarity to all stakeholders involved.
- Referencing the diagram presented earlier, it's clear that when a user submits their payment, six service operations are triggered (the ones colored in red).
- With the help of the SLO wizard's previews, we can know the weekly and monthly volumes of this user journey.
- Given this volume, the challenge becomes to determine what constitutes a problem that requires action. Is it when 20, 10, or even just 1 user struggles with payments in a month? This pivotal decision point calls for a comprehensive team discussion. Our platform leverages industry benchmarks and norms for various use-cases to facilitate these conversations and guide teams towards an aligned definition of what constitutes a satisfactory user experience.
- As a basic guideline, this information can inform the SLO target values of the six underlying service operations.
- These targets, in conjunction with the event volume, set the amount of errors contained in an SLO’s error budget. For instance, with an event volume of 1000 and a reliability target of 99.9%, your error budget will accommodate up to 10 errors. This means, each error will consume 10% of your error budget.
- By using historical data to preview the SLO, you can retroactively assess if the target you've set is realistic. More importantly, by setting SLOs for all operations, you can swiftly pinpoint the operations that are weighing your reliability down.
- Fortunately, this process is more straightforward than it sounds. The moment that you add a data-source, Rely ingests your telemetry data and matches it against a curated list of reliability templates developed by field experts to generate out of the box recommendations. This allows you to create dozens of SLOs in just minutes with a few simple clicks.
With SLOs in place, the product catalog shows its true value, enabling stakeholders to identify the health of both services and user journeys at a glance. It provides a dedicated health dashboard for each of these entities, offering comprehensive insights into the functionality and performance of your product.
Try out Rely.io
- If you want to know more or see Rely in action, book a demo with us.
- If you want to join us in discussing industry best practices, how they are implemented by your peers and contribute to our product direction, join our Slack Community.