Practical solutions for dealing with high cardinality metrics

Posted on 16/05/202311/06/2023 by Bernard Pietraga

High cardinality metrics can be cumbersome to use and affect your real-world observability capabilities, even when you theoretically have data with more dimensions. Dealing with high cardinality metrics can be challenging because these metrics involve a large number of different values. High cardinality can lead to increased memory consumption, slower query performance and difficulties in interpreting and visualising the data. The individual queries to the monitoring systems become expensive to compute.

This post is for users of systems that only work well with low medium metrics cardinality.

Remember that the choice of approach will depend on the specific characteristics of your data, the business context and the goals of your analysis. It’s important to experiment with different techniques and evaluate their effectiveness in overcoming the challenges posed by high cardinality metrics.

There is no "one size fits all" solution to all problems, as many issues with metrics depend on the architecture and also how they are ingested. In most cases, when cardinality gets out of hand and scaling the metrics backend is no longer feasible, the solution is to reduce the weight. The big question is how to achieve this without losing too much observability. My plan is to present a practical solutions, starting with simple ones and then moving on to more complex numerical approaches.

If you are new to the subject, Microsoft has a decent writeup on logging versus metrics versus traceing.

If you are interested in the topic of observability as a whole, free "Honeycomb’s O’Reilly Book Observability Engineering" By Charity Majors, Liz Fong-Jones, and George Miranda book will be a resource for you. Also It touches on this topic at a very high level, some of presented ideas like sampling are covered in great detail in the book.

Examples of problems caused by high cardinality

A good example of this is Victoria Metrics where the default configuration fails to return a query for metrics whose cardinality exceeds 300000 entries.

Another example might be that your AWS GetMetricData API becomes unbearably expensive or just returns too many data points requested in query A. Please try to reduce the time range.

Google Cloud even has a good writeup on Stackdriver merics that covers the topic of this article.

Last but not least example is your Grafana dashboards utilising PromQL queries loading slowly when there is no caching for in between data like Tickster.

Filters and transformations on metrics at your disposal

Before we look at ways to tackle and reduce cardinality, let’s look at the most common tools provided by common metrics backends. These are common tools found in Prometheus and others. More sophisticated solutions will be discussed later.

Label filtering allows you to evaluate the labels used in your metric data and identify labels that have a high cardinality but are not necessary for your analysis. You can exclude or limit the use of such labels in your queries to reduce the cardinality. Avoid using high cardinality labels in the group_by clause unless necessary.
Label value filtering is slightly different. It analyses the values of individual labels and identifies those with high cardinality that are not essential to your analysis. You can filter out specific label values or limit their use in queries to reduce cardinality. For example, if a label has many different values, consider excluding some values or grouping similar values together.
Data aggregation. Use aggregation capabilities to reduce the cardinality of your data. Aggregating data over time intervals can significantly reduce the number of data points and improve query performance. Use functions such as sum(), avg() or max() to summarise data over specific time periods or by relevant dimensions.
Rollup and pre-aggregation is very useful. Prometheus supports the concept of rollup rules, which allow you to pre-aggregate data at different resolutions and store it in a separate time series. This helps to reduce cardinality by providing summarised data for longer periods. You can define rollup rules in the Prometheus configuration or use tools such as VictoriaMetrics or Thanos to handle the pre-aggregation.
Downsampling involves reducing the resolution of your data by aggregating it into fewer data points over larger time intervals. This can significantly reduce the cardinality while still preserving the general trends in the data. Tools such as Thanos or VictoriaMetrics offer downsampling capabilities.
Relabelling allows you to change or remove labels in Prometheus without changing the original metric data. You can use relabelling to merge or remove labels to selectively reduce cardinality. Be careful when using relabelling as it can affect queries and alerting rules.
Federation can also help. If you have multiple backend instances in a federated setup, consider using federation to aggregate and reduce the cardinality of metrics across different instances. This allows you to have a centralised view of metrics while minimising the cardinality in each individual Prometheus server.

Simple approaches to high cardinality

Here are just the simplest, not the best, solutions to the problem. Sometimes this is good enough. Go to the numerical section for more interesting solutions.

Falling labels responsible for high cardinality

The simplest way to reduce cardinality is to drop the labels/values that introduce cardinality. While this sounds trivially simple, this decision usually involves offloading some of the observability to systems that provide traceability. This is the acceptable scenario if it doesn’t interfere too much with the ability to debug and measure systems. It is an unfortunate compromise as potentially valuable data may be missing when you need it.

Let’s present the PromQL example with example_metric. It has 3 labels, each with 2 possible values. This gives us a set with a total cardinality of 8.

example_metric{ "a"=1, "b"=1, "c"=1 }
example_metric{ "a"=0, "b"=1, "c"=1 }
example_metric{ "a"=0, "b"=0, "c"=1 }
example_metric{ "a"=0, "b"=1, "c"=0 }
example_metric{ "a"=1, "b"=1, "c"=0 }
example_metric{ "a"=1, "b"=0, "c"=0 }
example_metric{ "a"=1, "b"=0, "c"=1 }
example_metric{ "a"=1, "b"=0, "c"=0 }

Let’s remove the label "c". Now we have a cardianlity of 4.

example_metric{ "a"=1, "b"=0 }
example_metric{ "a"=1, "b"=1 }
example_metric{ "a"=0, "b"=0 }
example_metric{ "a"=0, "b"=1 }

By removing "c", you have just reduced the visibility of your system. This may mean that you need to include it elsewhere to maintain reasonable observability. This leads us to an alternative approach.

This can be done automatically on the metrics backend with label filtering.

Splitting the metrics into individual ones

An alternative approach is to split metrics, this will give us two different metrics from one. If the query only targets one of the metrics that are the result of the split, the metrics backend is under less stress. This gives us two metrics, each with a cardinality of 4. We end up with 8 unique metrics. Here we have two metrics example_metric_without_c and example_metric_with_c:

example_metric_without_c{ "a"=1, "b"=0 }
example_metric_without_c{ "a"=1, "b"=1 }
example_metric_without_c{ "a"=0, "b"=0 }
example_metric_without_c{ "a"=0, "b"=1 }

example_metric_with_c{ "a"=1, "b"=0 }
example_metric_with_c{ "a"=1, "b"=1 }
example_metric_with_c{ "a"=0, "b"=0 }
example_metric_with_c{ "a"=0, "b"=1 }

This approach is useful in systems with microservices, where the service name can be included in the metric name instead of the label. In the grafana backend, for example, this leads to faster queries. The total set of each unique metric is the same, but the cardinality of each metric is smaller.

Specific value filtering

Instead of dropping the whole label, focus on the values. Analyse the values of individual labels and identify those with high cardinality that are not essential to your analysis.

You can filter out specific label values or limit their use in queries to reduce cardinality. For example, if a label has many different values, consider excluding some values or grouping similar values together.

Let us use the split metrics from the previous example. But this time we filter the example_metric_without_c with "b" value equal to 0. As a result, we only have 6 unique metric entries.

example_metric_without_c{ "a"=1, "b"=1 }
example_metric_without_c{ "a"=0, "b"=1 }

example_metric_with_c{ "a"=1, "b"=0 }
example_metric_with_c{ "a"=1, "b"=1 }
example_metric_with_c{ "a"=0, "b"=0 }
example_metric_with_c{ "a"=0, "b"=1 }

Manually aggregating using domain knowledge

Consider reducing the dimensionality of the metric by aggregating or grouping similar values. For example, you can group rare or infrequent values into an ‘other’ category to simplify the data. Rollup and pre-aggregation can also be useful here.

Changing Sampling

If the high cardinality metric is causing performance problems, you may want to consider sampling a subset of the data for analysis. Random or stratified sampling techniques can help you work with a manageable subset while still capturing the essence of the data. This will certainly have an impact on your overall view.

Binning or bucketing using domain knowledge

Binning or bucketing is interesting because it can be done manually and numerically. Instead of using individual values, you can group values into bins or buckets. This approach can help reduce the cardinality while preserving the overall patterns and trends in the data. You can define the binning strategy based on domain knowledge.

Data aggregation in different time periods

For example, instead of analysing data at an individual user level, you can aggregate metrics at a daily, weekly or monthly level to gain insights at a higher level of granularity.

Numerical approaches

This is where the fun begins. The goal is to retain as much useful data as we need, while making the computations efficient. Here are some approaches to dealing with high cardinality metrics.

I’m writing about out-of-the-box solutions like Top N Analisys and more complex ones like building your own model and data pipeline and using it later in the rollup and pre-agreggation phase.

Top N – averages, means, precentilce numbers, outliers

This is an out-of-the-box solution that is common in toolkits. Focus on the top N values that contribute most to the metric. This approach allows you to prioritise the most important and influential values and ignore the rest, thereby reducing the cardinality.

In Prometheus, you can perform a top N analysis to focus on the top N values that contribute most to a metric, and later adjust your setup to reduce cardinality. This approach helps you to prioritise the most important and influential values while reducing the cardinality of the data.

Analysing the data – Top N

Construct a PromQL (Prometheus Query Language) query that retrieves the metric data you want to analyse. Specify the metric name and any necessary filters to narrow down the dataset.

Apply an aggregation function such as sum(), avg() or max() to the metric data.

Use the topk() function in PromQL to sort the aggregated data based on the metric value and restrict the results to the top N values. The topk() function takes two arguments: the number of top values (N) you want to focus on and the metric expression.

Applying a Transformation – Top N

After analysing the data, you can transform the data in the metrics backend. You have several tools at your disposal to apply rollup and pre-aggregation. You can only append the metrics you are interested in, in this case the Top N.

Let’s move to the next one

Numerical binning or bucketing

Binning or bucketing has already been mentioned. Consider using advanced analytics techniques such as clustering or anomaly detection to identify patterns or outliers within the high cardinality metric. These techniques can help uncover hidden insights and provide more meaningful analysis of the data.

In the numerical approach, you can define the binning strategy based on the use of statistical methods such as equal width, equal frequency or clustering algorithms. You may be interested in [K means clustering] (https://en.wikipedia.org/wiki/K-means_clustering).

However, Prometheus and other popular toolkits do not provide built-in functionality for k-means clustering of metrics. Prometheus is primarily focused on collecting, storing and monitoring time series data.

To perform k-means clustering on Prometheus metrics, you would typically need to export the relevant metric data from Prometheus and use external machine learning or data analysis tools.

Employing tensorflow to do clustering

Tensorflow provides a nice tutorial on importing data from prometheus, the same can be done with pytorch.

Let’s use some of the code from there and modify it to do K-Means. We want 3 clusters. This is just a code example:

n_samples = 10

dataset = tfio.experimental.IODataset.from_prometheus(
    "go_memstats_sys_bytes", n_samples + n_steps - 1 + 1, endpoint="http://localhost:9090")

# take go_memstats_gc_sys_bytes from coredns job 
dataset = dataset.map(lambda _, v: v['coredns']['localhost:9153']['go_memstats_sys_bytes'])

# find the max value and scale the value to [0, 1]
v_max = dataset.reduce(tf.constant(0.0, tf.float64), tf.math.maximum)
dataset = dataset.map(lambda v: (v / v_max))

# expand the dimension by 1 to fit n_features=1
dataset = dataset.map(lambda v: tf.expand_dims(v, -1))

# take a sliding window
dataset = dataset.window(n_steps, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda d: d.batch(n_steps))

# the first value is x and the next value is y, only take 10 samples
x = dataset.take(n_samples)
y = dataset.skip(1).take(n_samples)

dataset = tf.data.Dataset.zip((x, y))

# Define the K-means clustering model
kmeans = tf.compat.v1.estimator.experimental.KMeans(num_clusters=3)

# Train the K-means model
kmeans.train(dataset)

# Get the cluster assignments and cluster centers
cluster_assignments = list(kmeans.predict_cluster_index(input_fn))
cluster_centers = kmeans.cluster_centers()

# Print the cluster assignments
print("Cluster Assignments:")
for i, cluster in enumerate(cluster_assignments):
    print(f"Sample {i+1}: Cluster {cluster}")

This can later be used to append back to promethues after the label structure has been modified using clustering. This brings us to the next paragraph.

Feature Engineering

Explore ways to derive new features from the high cardinality metric. These derived features can capture important aspects of the data without directly using the original high cardinality metric. Feature engineering techniques such as one-hot coding, frequency coding or target coding can be useful in transforming high cardinality metrics.

Dimensionality reduction techniques using common mathematical techniques

Dimensionality reduction techniques can be performed using common machine learning techniques. Use dimensionality reduction techniques such as Principal Component Analysis (PCA) or T-Distributed Stochastic Neighbour Embedding (t-SNE) to project the high-dimensional metric into a lower-dimensional space while preserving the most important patterns and relationships. This blog post has great examples of t-SNE.

That is all for this short post about "Practical solutions for dealing with high cardinality metrics". Thanks for reading!