Prometheus Recording Rules: Mastering

Dealing with Prometheus queries can sometimes feel like navigating a maze. You’ve likely found yourself wrestling with complex PromQL, trying to extract the exact metrics you need. But there’s a way to simplify this. A solution that helps you work with precomputed data. It involves using Prometheus Recording Rules. These rules let you store results from common queries into new time series. Making your queries simpler, faster, and more efficient.

Table of Contents

What are Prometheus Recording Rules?

Think of Prometheus Recording Rules as a way to make custom metrics that are based on existing ones. They work by taking the results of a PromQL query and storing them as a new time series. These new time series can be then queried with much less effort. They allow you to compute expensive, complex metrics ahead of time, and then query the result as a simpler, pre-computed value. This not only makes your queries run faster, but it also simplifies your dashboard and alert definitions.

Recording Rules work by taking the results of a query and storing them in a new metric. This new metric is like any other metric in Prometheus. It can be queried, graphed, and used for alerts. The difference is that this metric’s data is derived from a query rather than from a direct scrape. Recording Rules also allow you to create metrics that aren’t directly available from your targets. Or metrics that require a good bit of calculation to be useful.

Why are Prometheus Recording Rules Useful?

The main advantage is simplicity and speed. You can take a complex query, run it once, and store the result. You can then query that result as a simple metric. This can make dashboard creation and alert definition much more straightforward and less resource-intensive.

Faster Queries

Without Recording Rules, your dashboards could be running complex PromQL queries every few seconds. This can tax your Prometheus server. And could also slow down your dashboards. By using Recording Rules, you can compute these complex queries once, and then your dashboards can query a much lighter metric. This gives you a smoother, faster dashboard.

Simplified Dashboards

Using Recording Rules helps you simplify your dashboards by cleaning up complex PromQL from the graphs. Now you can just reference pre-computed metrics, making your dashboards easier to read, share, and maintain. This is especially handy when you are working with teams. Everybody will know what metrics they are using.

Better Alerting

Complex alert rules can be a challenge to debug. By using Recording Rules, you can break down complex alert logic into simpler pieces. This makes your alerts easier to understand and maintain. And you don’t have to rewrite the same logic over and over again.

Conserving Resources

Recording Rules are not just about ease and speed. They are also about efficiency. By running complex queries only once, you save a huge amount of processing power on your Prometheus servers. This allows you to handle more data, and do so without extra hardware.

How do Prometheus Recording Rules Work?

Recording Rules live in a configuration file. You’ll have to tell Prometheus to look for your configuration file with Recording Rules. Here’s a quick look at how they work.

Configuration

First, you define your recording rules in a YAML file. Each rule defines:
– a new metric name
– the PromQL query that produces that metric
– a set of labels that help you identify the metric

Evaluation

Prometheus periodically runs the queries defined in your recording rules. And then, it stores the results as a new time series. The evaluation frequency can be customized, but it is typically every minute, to align with the usual scrape interval of Prometheus.

Storage

The results of your recording rules are stored like any other metric in Prometheus. That means, they are available for querying, graphing, and use in alerts, just like any other metric.

Querying

You can query the new metrics by their given name as you would any other metric. This simplifies dashboard creation and alerts. By querying a single metric rather than a complex query each time.

Creating Prometheus Recording Rules

Let’s get down to the actual way of doing things. The structure of a recording rule file is simple and easy to understand.

Recording Rule File Structure

You start with a group. Each group of rules defines a set of related recording rules. All recording rules in a group are evaluated at the same time, so they should be related to each other in some way.

A rule_files section in the Prometheus config tells Prometheus where your rule files are.

Here is a basic structure example:

rule_files:
  - "rules/*.yml"

groups:
  - name: my_recording_rules
    rules:
    - record: my_new_metric
      expr: sum(my_metric)
    - record: my_second_new_metric
      expr: avg(my_other_metric)

In this example:
* rule_files defines that Prometheus should read all .yml files in the rules/ directory.
* groups tells Prometheus to evaluate the rules under my_recording_rules.
* rules is a list of recording rules that follow:
* record is the name of the metric, this is the metric name you query later.
* expr is the PromQL query that computes the result.

Defining a Simple Recording Rule

Now, let’s dive a little deeper into a simple recording rule and how it is defined:

groups:
  - name: my_recording_rules
    rules:
    - record: my_summed_metric
      expr: sum(my_original_metric)

In this example:
– record: This indicates the name of the new metric (my_summed_metric) that you will be able to use in your dashboards or alerts.
– expr: This defines the PromQL expression (sum(my_original_metric)) which calculates the sum of all values for the metric my_original_metric.
This new metric, my_summed_metric, can then be queried as if it was a native Prometheus metric.

Using Labels in Recording Rules

Labels are a key part of working with Prometheus, Recording Rules are no different. Labels help you give context to your metrics. Let’s extend the above example by adding labels:

groups:
  - name: my_recording_rules
    rules:
    - record: my_summed_metric
      expr: sum(my_original_metric) by (instance, job)

The by (instance, job) part ensures that the sum is calculated for each combination of instance and job labels.

So, this query:
sum(my_original_metric)
will output a single value for all instances and jobs. This will output the sum of my_original_metric from every instance and every job that sends this metric to Prometheus. This is not that useful, as you can’t know which instances or which jobs are sending this metric.

While this query:
sum(my_original_metric) by (instance, job)
will output a value for every instance and every job combination. So for the instance my-server-a and job my-backend, and also for instance my-server-b and job my-frontend, and so on. This makes the metric useful, as you can know where the metric value is coming from.

This gives you a more granular view of your data. And makes it easier to filter the results based on your specific needs. You also get all of the features of Prometheus labels. Like filtering and aggregation.

Working with Rates

Working with rates and derivatives, especially when it comes to metrics related to counters, are important. These types of metrics don’t represent an actual number, but instead represent a number that can only increase. So when you want to graph them or generate an alert, you have to calculate the rate of increase.

Here is how you can calculate a metric that represents the rate of increase per second of another metric, in a Recording Rule:

groups:
  - name: my_recording_rules
    rules:
    - record: my_metric_rate
      expr: rate(my_counter_metric[5m])

In this example:
–rate(my_counter_metric[5m]) calculates the rate of change per second of my_counter_metric over the last 5 minutes. This gives you the rate of increase of this metric per second.

This rate is useful for graphing and alerting. As a counter value itself isn’t very descriptive and it is hard to alert using it.

Complex Recording Rule Example

Let’s see a more complex real-world example of using a Recording Rule that makes a composite metric out of multiple other metrics.

groups:
  - name: my_complex_rules
    rules:
    - record: my_requests_per_second
      expr: sum(rate(my_http_requests_total[5m])) by (job, instance)
    - record: my_average_request_duration
      expr: sum(rate(my_http_request_duration_seconds_sum[5m])) by (job, instance) / sum(rate(my_http_request_duration_seconds_count[5m])) by (job, instance)
    - record: my_requests_duration_per_second
      expr: my_requests_per_second * my_average_request_duration

In this example:
– my_requests_per_second is computed from my_http_requests_total, and it represents the number of requests per second per job and instance.
– my_average_request_duration is computed from the my_http_request_duration_seconds_sum and my_http_request_duration_seconds_count metrics, representing the average duration of the requests in seconds per job and instance.
– my_requests_duration_per_second takes those metrics and multiplies them to output the total amount of time requests take up per second, per job and instance.

You can use my_requests_duration_per_second to generate alerts for slow services and even show them in dashboards.

Best Practices for Prometheus Recording Rules

Even though Recording Rules are a powerful way to improve your monitoring, you need to follow some best practices to get the most out of them.

Naming Conventions

Use descriptive names for your new metrics, so that they can be easily identified and understood. For example, instead of new_metric_1, use http_requests_per_second. Make the naming scheme consistent and easy to understand for the whole team.

Avoid Redundancy

Ensure you’re not recomputing data that already exists. It is better to reuse existing recording rules rather than create similar rules. This can make your recording rules easier to understand, debug, and share with the team.

Keep Rules Simple

Start with simpler rules and increase complexity if necessary. This makes it easier to debug and manage the rules. Complex rules can be hard to read and debug.

Review and Update

Periodically review and update your rules to make sure they match your monitoring needs. You might not need all of your rules as you change your systems, so deleting old rules is also a good practice.

Document Your Rules

Add a comment on each rule, explaining what it is doing. The team will appreciate the time you take to do this. Clear documentation will help you and others understand the rules later, and can help you debug things.

Test Your Rules

Before deploying, make sure your rules are tested. You can use the Prometheus query editor to test them. It can also help you understand if the queries return the correct information.

Debugging Prometheus Recording Rules

Debugging Recording Rules can be a headache if not handled well. Here are some things you can check.

Query the Metric

If your recording rules aren’t working as they should, start by querying the new metric directly in the Prometheus query explorer. Check if the values are being recorded at all. If the metric isn’t being generated, then you have to look at the query itself.

Check your Expr

Check your expr to see if it returns the value you expect. By querying the same expression inside your recording rule, you will be able to debug if it returns the proper data. You can test this in the Prometheus query explorer, or even by using a tool like promtool.

Check the Labels

Check to see if your labels are correct. Make sure the by() part contains the correct labels, as these labels will define how your metrics are broken down.

Review the Prometheus Logs

Check the Prometheus logs to see if there are errors related to recording rules. This will be helpful when debugging syntax errors or any other configuration issue. You should also check the logs for errors related to query timeouts.

When to Use Prometheus Recording Rules

Knowing when to use Recording Rules is key to making the most of them. Here are some common use cases:

Complex Calculations

When you need to calculate complex metrics out of other metrics, Recording Rules are the way to go. You can use rates, averages, and a host of other functions to get the exact metrics you want.

Aggregated Metrics

When your use case requires you to aggregate metrics from many sources into one, Recording Rules can help. This includes things like summing metrics by different labels, or dividing metrics to create a rate.

Simplifying Queries

When your PromQL queries get too complex for dashboards and alerts, Recording Rules can be a great help. You can create a new metric that represents what you need.

Historical Data

If your use case needs you to generate historical metrics out of existing metrics, Recording Rules are the right tool. You can generate a new metric that represents the average or max of a metric for the last hour, day or even week.

When Not to Use Prometheus Recording Rules

Like every tool, Recording Rules are not meant for every use case. Here are some instances when they might not be the best option:

Simple Queries

If the query is already simple, there is little reason to create a recording rule for it. Creating a recording rule for a simple query may even add some overhead and make things more difficult.

Ad-Hoc Analysis

For ad-hoc analysis, Recording Rules are not really needed. It is better to use the PromQL query tool directly in the Prometheus explorer for a one time query.

Highly Volatile Metrics

Recording rules are meant for longer time series metrics. If the metrics are very volatile, recording rules may become too expensive to calculate. And may impact the overall performance of your Prometheus instance.

Combining Recording Rules with Other Prometheus Features

Recording Rules are a good feature on its own. But when combined with other Prometheus features they become an indispensable part of your monitoring toolkit.

Alerting

Recording Rules can make complex alert logic simpler. By pre-computing metrics, your alerts can become faster to evaluate, and easier to understand.

Dashboards

Recording Rules reduce the complexity of dashboards, making them easier to read and manage. This greatly enhances the experience of users that work with your dashboards.

Service Discovery

Service discovery combined with Recording Rules, allows you to dynamically generate new metrics for new services. This dynamic setup will save you a lot of time.

Federation

Using federation with Recording Rules you can aggregate data from multiple Prometheus servers. This allows you to generate a global view of your systems.

Advanced Techniques for Prometheus Recording Rules

Here are some advanced techniques that you can use to take Recording Rules to the next level.

Dynamic Labels

You can generate new labels based on existing ones, using functions like label_replace. This allows you to re-label existing metrics, which can help when you want to aggregate your data using specific labels.

Multi-Tenancy

You can use Recording Rules to generate metrics per tenant. This is useful for managed services and multi-tenant systems. Where you want to break down your metrics per tenant that is consuming the service.

Custom Functions

If you have a very specific use case that needs it, you can create custom functions to use in your Recording Rules. Though this can get very complex, so you should only do it when necessary.

Real-World Examples of Prometheus Recording Rules

Here are some more real world examples of Recording Rules in different environments and setups:

Web Application Monitoring

For a web application you can use Recording Rules to compute metrics like:
– Requests per second
– Error rates per endpoint
– Average request duration per endpoint
– Number of active users

Database Monitoring

For a database you can use Recording Rules to compute metrics like:
– Number of active connections
– Average query time
– Number of slow queries
– Cache hit rate

Infrastructure Monitoring

For your infrastructure you can compute metrics such as:
– CPU usage by host
– Memory usage by host
– Network throughput
– Disk utilization

Batch Job Monitoring

For batch jobs you can use Recording Rules to compute things like:
– Number of jobs run per hour, day and week
– Average job execution time
– Number of successful and failed jobs

Prometheus Recording Rules: A Summary

Prometheus Recording Rules are a vital part of managing and monitoring your systems effectively. By pre-computing your metrics, you can streamline dashboards, simplify alerting, and save precious resources on your Prometheus server. Use this guide to help you use recording rules and enhance your monitoring setup. They are well worth the effort and will reward you greatly.

Mastering Prometheus Queries: Is Recording Rules The Missing Piece?

Now that you’ve explored the world of Prometheus Recording Rules, you can see how they can change the way you monitor your systems. They allow you to extract insights from complex metrics and provide you with actionable data. If you’ve been struggling with complex queries and slow dashboards, it’s clear that mastering Recording Rules can be the missing puzzle piece in taking your monitoring to the next level.