Bug 1945849 - Unnecessary series churn when a new version of kube-state-metrics is rolled out
Summary: Unnecessary series churn when a new version of kube-state-metrics is rolled out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1925061 1945851
TreeView+ depends on / blocked
 
Reported: 2021-04-02 09:11 UTC by Simon Pasquier
Modified: 2021-07-27 22:57 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1945851 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:57:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kube-state-metrics servicemonitor file (2.54 KB, text/plain)
2021-04-06 08:26 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1052 0 None closed Bug 1925061: Remove the "instance" and "pod" labels for kube-state-metrics metrics 2021-04-02 09:12:26 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:57:42 UTC

Description Simon Pasquier 2021-04-02 09:11:38 UTC
Description of problem:
When a new version of kube-state-metrics is rolled out (e.g. on upgrade or node drain), the pod and instance labels will change for all the series exported by kube-state-metrics. In practice these labels aren't useful because only one instance of kube-state-metrics is deployed.

Version-Release number of selected component (if applicable):
4.7

How reproducible:
Always

Steps to Reproduce:
1. Upgrade from 4.7 to 4.8
2.
3.

Actual results:
kube-state-metrics generates new series for all its metrics because the values of pod/instance have changed. It increases the memory usage of Prometheus until the old series are removed from the head (e.g. after 3 hours).

Expected results:
No additional series created.

Additional info:

Comment 1 Simon Pasquier 2021-04-02 09:12:26 UTC
Fixed in https://github.com/openshift/cluster-monitoring-operator/pull/1052

Comment 3 Junqi Zhao 2021-04-06 08:23:45 UTC
tested with 4.8.0-0.nightly-2021-04-05-174735, kube-state-metrics has two endpoints, search metrics exposed by 8443 port, no instance label in the metrics, but still exists in 9443 port
kube_pod_info{namespace="openshift-monitoring", pod=~"kube-state-metrics-.*"}
kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="ReplicaSet", created_by_name="kube-state-metrics-5759d57c7b", endpoint="https-main", host_ip="10.0.128.4", job="kube-state-metrics", namespace="openshift-monitoring", node="ci-ln-26d5gbt-f76d1-kj66g-worker-d-lzpgs", pod="kube-state-metrics-5759d57c7b-7c2g8", pod_ip="10.131.0.7", priority_class="system-cluster-critical", service="kube-state-metrics", uid="554935e6-6e73-48f5-8930-e4794eeb6a20"}

# oc -n openshift-monitoring get ep kube-state-metrics
NAME                 ENDPOINTS                         AGE
kube-state-metrics   10.131.0.7:8443,10.131.0.7:9443   149m

kube_state_metrics_watch_total{namespace="openshift-monitoring", pod=~"kube-state-metrics-.*"}
kube_state_metrics_watch_total{container="kube-rbac-proxy-self", endpoint="https-self", instance="10.131.0.7:9443", job="kube-state-metrics", namespace="openshift-monitoring", pod="kube-state-metrics-5759d57c7b-7c2g8", resource="*v1.ConfigMap", result="success", service="kube-state-metrics"}

Comment 4 Junqi Zhao 2021-04-06 08:26:29 UTC
Created attachment 1769515 [details]
kube-state-metrics servicemonitor file

Comment 5 Simon Pasquier 2021-04-06 08:59:40 UTC
@Junqi it's expected because in the case of kube_pod_info, the "pod" label represents the pod being monitored and it is required to make sense of the metric. If you look at other kube_ metrics (for instance kube_node_role), you should see that the instance and pod labels are absent.
 
On a 4.5 cluster, 'kube_pod_info{namespace="openshift-monitoring", pod=~"kube-state-metrics-.*"}' returns:

kube_pod_info{created_by_kind="ReplicaSet",created_by_name="kube-state-metrics-c9b6d7fb7",endpoint="https-main",host_ip="10.0.171.129",instance="10.128.5.75:8443",job="kube-state-metrics",namespace="openshift-monitoring",node="ip-10-0-171-129.eu-west-3.compute.internal",pod="kube-state-metrics-c9b6d7fb7-82fd5",pod_ip="10.128.5.75",priority_class="system-cluster-critical",service="kube-state-metrics",uid="3538a090-892f-41f6-88d1-474645029f03"}


As for kube_state_metrics_watch_total metric, it's part of the internal metrics exposed by a different endpoint (on port 9443 while the "main" kube-state-metrics are exposed on port 8443). Because this endpoint reports the internal kube-state-metrics metrics, it makes sense to keep the instance and pod labels. Also the endpoint reports a hundred of series compared to the :8443 endpoint so the series churn in case of rollout is acceptable.

Comment 6 Junqi Zhao 2021-04-06 10:04:10 UTC
based on Comment 5, pod label exists for kube_pod_info is expected, the instance and pod labels are absent for other kube_* metrics, example:
kube_node_role result for 4.8.0-0.nightly-2021-04-05-174735
kube_node_role{container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", node="ip-10-0-152-213.us-west-2.compute.internal", role="worker", service="kube-state-metrics"}

result with 4.7.0-0.nightly-2021-04-05-201242, could see instance and pod labels
kube_node_role{container="kube-rbac-proxy-main", endpoint="https-main", instance="10.128.2.20:8443", job="kube-state-metrics", namespace="openshift-monitoring", node="ip-10-0-128-207.us-east-2.compute.internal", pod="kube-state-metrics-67c786b7f6-x5pcx", role="worker", service="kube-state-metrics"}

set to VERIFIED

Comment 9 errata-xmlrpc 2021-07-27 22:57:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.