1966104 – [OSD] kube_pod_container_status_running stopped working after upgrade

Bug 1966104 - [OSD] kube_pod_container_status_running stopped working after upgrade

Summary: [OSD] kube_pod_container_status_running stopped working after upgrade

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jan Fajerski
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-31 11:46 UTC by Ron Green
Modified:	2024-12-20 20:08 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-30 08:44:31 UTC
Target Upstream Version:
Embargoed:
Flags:	rogreen: needinfo-

Attachments	(Terms of Use)
prometheus graph (341.45 KB, image/png) 2021-05-31 11:46 UTC, Ron Green	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1955478	1	high	CLOSED	Drop high-cardinality metrics from kube-state-metrics which aren't used	2024-12-20 19:59:20 UTC
Red Hat Bugzilla	1955482	1	high	CLOSED	[4.7] Drop high-cardinality metrics from kube-state-metrics which aren't used	2021-08-20 15:43:15 UTC
Red Hat Bugzilla	1955483	1	high	CLOSED	[4.6] Drop high-cardinality metrics from kube-state-metrics which aren't used	2021-09-20 13:25:51 UTC
Red Hat Knowledge Base (Solution)	6273601	0	None	None	None	2021-08-20 15:42:56 UTC

Description Ron Green 2021-05-31 11:46:58 UTC

Created attachment 1788215 [details]
prometheus graph

Description of problem:
Version-Release  Name:           4.7.12
Digest:         sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e
Created:        2021-05-20T18:48:00Z
OS/Arch:        linux/amd64
Manifests:      481
Metadata files: 1
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e

How to reproduce:
Steps to Reproduce:
1. To reproduce create a cluster on 4.7.9 and upgrade from 4.7.9 to 4.7.12
2. Add some data Prometheus and Run this query Prometheus
 (kube_pod_container_status_running) make sure data is present before upgrade. 
3. See that no kube_pod_container_status_running data is present after the ocp 4.7.9 to 4.7.12 in Prometheus

Actual results:
absent(kube_pod_container_status_running) is returning 1 from the upgrade time
Expected results:
kube_pod_container_status_running is missing metrics from upgrade time
Additional info:

Comment 1 mark freer 2021-05-31 12:18:44 UTC

@jfajersk Here are the results of running the following Prometheus query 

Links to cluster: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.expr=kube_pod_container_status_running&g0.tab=0 cluster in product having the issue

Link: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first upgrade performed from 4.7.9 to 4.7.12 reproduced issue.

A must-gather is not working for me on cluster.

Comment 2 mark freer 2021-05-31 12:22:17 UTC

(In reply to mark freer from comment #1)
> @jfajersk Here are the results of running the following
> Prometheus query 
> 
> Links to cluster:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.
> expr=kube_pod_container_status_running&g0.tab=0 cluster in production having
> the issue
> 
> Link:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.
> expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first
> upgrade performed from 4.7.9 to 4.7.12 reproduced issue.
> 
> A must-gather is not working for me on cluster.

Comment 4 Simon Pasquier 2021-06-16 07:38:41 UTC

The kube_pod_container_status_running metric has been excluded from the list of metrics exposed by kube-state-metrics [1] because it has a too high cardinality and it wasn't used in any dashboard/rule shipped by the monitoring stack. Depending on the use case, you might use another metric, like kube_pod_container_info or kube_pod_container_status_ready, to achieve the same result. Can you explain in more details what you used the metrics for?

In general, we don't provide any guarantee on metrics stability.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/e3bce4162877ff73b499a5f2a715125f929b9948/assets/kube-state-metrics/deployment.yaml#L47

Comment 6 Jan Fajerski 2021-06-29 13:11:33 UTC

From an email conversation:

Customer updated following detail :
This metric is used in the dashboard to show the amount of pods running in the namespace.
ex:  count (rate(kube_pod_container_status_running{env=~'$env',org='paas',container!="POD"} [10m])) by (namespace)


I would recommend the kube_pod_container_status_ready metric for this purpose. Arguably this is better metric anyway, since a pod can be running but not accepting traffic (making it not ready).

This query looks strange though, maybe the customer wants to revisit this. I don't understand the function of the rate function. As per the docs, rate "calculates the per-second average rate of increase of the time series in the range vector". A status metric however is either 0 or 1, so rate is a strange choice here.

Unless I overlook something here, I'd recommend something like sum(kube_pod_container_status_ready) by(namespace) (needs the desired label matching still).

Comment 7 Jan Fajerski 2021-06-30 08:44:31 UTC

Customer ended up using kube_pod_status_phase metric and it worked fine as per their requirement. Closing this accordingly, feel free to re-open if any more questions arise. Closing this as NOTABUG since the removal was intentional and other metrics can replace the removed metric.

Note You need to log in before you can comment on or make changes to this bug.