What is Performance Monitoring?
Developing an amazing product is only part of the job. Ensuring that it works as intended and that it meets your quality standards is a whole other story. It is normal for all types of companies to face observability issues, where it is hard to truly understand the internal states of a complex system. These issues can compromise your ability to measure key performance metrics about the usability of your product and therefore your ability to grow into the next level.
Performance Monitoring is the art of ensuring the quality performance of your services. It involves generating accurate performance data, processing it into actionable information and choosing how to provide this information to agents that can act upon it via dashboards, alerts, reports, etc.
While many modern services provide monitoring metrics right out-of-the-box that you can use for monitoring purposes, as you scale, and if you want to dig deep, you’ll want to create insights that are customized to your service, you’ll need to instrument your code base.
Service Reliability Engineering (SRE), emerged as the state-of-the-art set of practices for monitoring services. Its three main value propositions are:
- Ensuring quality standards
- Mitigate problems before they become incidents
- Allow you to understand where you should focus engineering efforts
In this article we’ll be going over how to instrument your code-base with an SRE mindset using Prometheus, which is one of the industry standards tools for system monitoring. We’ll be dividing this into two sections: the first will inform you about what data you’re trying to generate and why; the second will provide a high level introduction to Prometheus and elaborate about how you can easily achieve what was described in the first part.
The SRE mindset
Using Prometheus to instrument your service
Prometheus allows you to effortlessly generate data throughout your code-base to generate insightful metrics. Once you know its basics, it’ll be relatively easy to apply the RED method to any of your services. Prometheus has been extensively explained in many other articles and we won’t go over every small detail here, but just to cover the basics. There are 4 kinds of metrics than you can create:
- Counters - a cumulative metric that can only increase or reset (You can use a counter to represent the number of requests served, tasks completed, or errors)
- Gauges - a numerical value that can arbitrarily go up and down
- Histograms - samples of observations that can be configurable into buckets according to its raw value
- Summaries - samples of observations that can be configurable into buckets according to its quantiles
You’ll only need two metric types in order to fully implement the RED method, them being counters (to keep track of your requests and errors), and histograms (to measure the duration). Another concept that's important for you to understand is the way Prometheus identifies its metrics, through labels. Labels are key-value pairs that together with the metric’s name identify a unique time-series. This is the industry’s standard because it allows for a very precise yet modular data structure that eases aggregation and decoupling according to many different dimensions. Notice the following example, with metric name source_requests_total and the label keys env, job, source, monitor, service, instance, task_arn.
Let’s jump into a practical example using Python, which hopefully makes it easier to understand. For this we’ll be using the official Python client for Prometheus. Imagine that you’re an engineer tasked with measuring the performance of your company’s payment processing service, which looks something like this:
It’s a complex service in the background but if you really think about it its main workflow is pretty vanilla: You try to perform a certain job, if you fail with a known error, you handle it and return an error (-1 in this case), otherwise you just log it and re-raise the exception so another level of your code-base can handle it. This logic allows for three mutual exclusive final states:
- Either the service terminates successfully
- Or it fails, but it is doesn’t break execution
- Or it fails miserably, while raising an exception and breaking execution
Let’s create a general Prometheus logging function that covers all three states, but first we start with our imports and by initializing the three metrics we wish to measure:
Then, we can already define the functions that feed data into these metrics, as well as the label values that characterize them.
Now, regarding the actual instrumentation logic: while not mandatory, it is a good practice to keep your instrumentation code as abstracted from your actual service code as possible, in order to avoid undesired entanglements and dependencies. Instrumentation should be a silent observer that isn’t felt and doesn’t interfere, yet it captures the desired details of your service’s performance so you can act upon this data. As such, we’ll make use of python decorators which serve as a nice abstraction layer that extends your code’s capabilities.
There’s a lot happening here, so let’s explain it bit, by bit.
- We start with our logging function, the prometheus_logger that simply calls the previously defined logic to emit metric events according to whether there was an error or not.
- Then we have our error_validator that decides whether the response returned by our service constitutes a success or not.
- And finally we have our python decorator instrumentation_log that provides a wrapper to our service. It basically just calls whichever function is decorating (received as the target_function) and then applies some simple logic to decipher which Prometheus events it should emit.
Applying our instrumentation to our service is then only a matter of adding a single line of code above our service’s main function: as clean as that, perfectly abstracted.
By enabling some more complex labeling and by adapting the error_validator function to each particular use-case, this logic can easily be re-used to cover any scenario where you want to apply the RED method to.
Your first SLI
It's time to take it home! Assuming that you have set up your Prometheus instance and that you’re collecting your instrumentation metrics (we won’t go into much detail here about, as it would fall outside the scope of this article but there are lots of resources that can help you with that), you are now ready to set up your first SLI. You can actually just sign-up into your detech.ai workspace and be done with this in a few seconds. This is what we’ll use to show-case to you what insights you could generate.
So this is what the instrumentation end-goal looks like:
Don’t be fooled! The errors are there, they just look way too tiny, when compared to successful requests. If we hide the requests' metric, you can clearly see them.
We then use these two beauties to calculate the underlying SLI, which can be seen below. What are you looking at exactly? As mentioned, the SLI is simply the proportion of good events over valid events for a given period of time. For this example, we chose a rolling window of 7 days, which means that each SLI measurement uses event data of the last 7 days. As good events occur, the SLI increases towards 100%, as the amount of errors increases, the SLI is pushed towards zero. It’s simple, yet effective.
Outages, where severe incidents take down your system are marked by a sudden decrease of your SLI, as errors flood in. Bugs or poorly implemented parts of your code that break for some edge-cases are more silent but can be identified by the continuous occasional drop of your indicator.
This little chart can be the first brick of your monitoring journey that provides the window of observability your organization is craving for. As mentioned before, this is actually just the beginning of a vast sea of possibilities, there are other SRE concepts that are built on top of your SLI: you can set up objectives based on your quality standards to monitor how close you are to breaking them, which implicitly creates a budget for the amount of errors that you are allowed to endure.
This allows you to also know how fast or slow you are consuming the budget and make appropriate decisions beforehand, which you can also use to set up alerts and be notified whenever stuff starts to break.
We at detech.ai are here to make your life as easy as possible and help you throughout your SRE journey. You handle the instrumentation, we handle the monitoring.
References
Best Practices for Setting SLOs and SLIs For Modern, Complex Systems
Resilience First: SRE and the Four Golden Signals of Monitoring
“Real-World SRE The Survival Guide for Responding to a System Outage and Maximizing Uptime” - Chapter 2: Monitoring, by Nat Welch