Bug 1995695 - Get insights on series churn during upgrades
Summary: Get insights on series churn during upgrades
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Arunprasad Rajkumar
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1995699 1998969
TreeView+ depends on / blocked
 
Reported: 2021-08-19 15:51 UTC by Damien Grisonnet
Modified: 2021-10-18 17:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1995699 1998969 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:47:27 UTC
Target Upstream Version:
Embargoed:
arajkuma: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1313 0 None Closed [RFE] unable to register any clients to satellite 6 2022-04-21 18:59:51 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:47:42 UTC

Description Damien Grisonnet 2021-08-19 15:51:28 UTC
Description of problem:

We want to gather information about series churn during upgrades since it will be helpful to know the additional memory usage that Prometheus will have to handle after the update. This may also be helpful to detect high-cardinality metrics.

It would be great to gather this information via Telemetry to have this insight on customer clusters.

To do that, we could use the `scrape_series_added` and `scrape_samples_scraped` metrics from Prometheus. More information about this can be found in this blog post: https://www.robustperception.io/finding-churning-targets-in-prometheus-with-scrape_series_added.

In addition, this information will be very useful to future resiliency improvements since based on the data gathered via telemetry, we will be able to set sane limits to Prometheus that would prevent any malicious target to cause trouble.

Comment 1 Arunprasad Rajkumar 2021-08-20 08:06:49 UTC
@spasquie suggested to go with `scrape_samples_post_metric_relabeling` instead of `scrape_samples_scraped`

For more details refer,

https://github.com/openshift/cluster-monitoring-operator/pull/1313#issuecomment-902490242

Comment 11 errata-xmlrpc 2021-10-18 17:47:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.