Bug 1966104 - [OSD] kube_pod_container_status_running stopped working after upgrade
Summary: [OSD] kube_pod_container_status_running stopped working after upgrade
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Jan Fajerski
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-31 11:46 UTC by Ron Green
Modified: 2021-12-27 10:30 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-30 08:44:31 UTC
Target Upstream Version:
Embargoed:
rogreen: needinfo-


Attachments (Terms of Use)
prometheus graph (341.45 KB, image/png)
2021-05-31 11:46 UTC, Ron Green
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1955478 1 high CLOSED Drop high-cardinality metrics from kube-state-metrics which aren't used 2021-11-03 09:11:06 UTC
Red Hat Bugzilla 1955482 1 high CLOSED [4.7] Drop high-cardinality metrics from kube-state-metrics which aren't used 2021-08-20 15:43:15 UTC
Red Hat Bugzilla 1955483 1 high CLOSED [4.6] Drop high-cardinality metrics from kube-state-metrics which aren't used 2021-09-20 13:25:51 UTC
Red Hat Knowledge Base (Solution) 6273601 0 None None None 2021-08-20 15:42:56 UTC

Description Ron Green 2021-05-31 11:46:58 UTC
Created attachment 1788215 [details]
prometheus graph

Description of problem:
Version-Release  Name:           4.7.12
Digest:         sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e
Created:        2021-05-20T18:48:00Z
OS/Arch:        linux/amd64
Manifests:      481
Metadata files: 1
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e

How to reproduce:
Steps to Reproduce:
1. To reproduce create a cluster on 4.7.9 and upgrade from 4.7.9 to 4.7.12
2. Add some data Prometheus and Run this query Prometheus
 (kube_pod_container_status_running) make sure data is present before upgrade. 
3. See that no kube_pod_container_status_running data is present after the ocp 4.7.9 to 4.7.12 in Prometheus

Actual results:
absent(kube_pod_container_status_running) is returning 1 from the upgrade time
Expected results:
kube_pod_container_status_running is missing metrics from upgrade time
Additional info:

Comment 1 mark freer 2021-05-31 12:18:44 UTC
@jfajersk Here are the results of running the following Prometheus query 

Links to cluster: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.expr=kube_pod_container_status_running&g0.tab=0 cluster in product having the issue

Link: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first upgrade performed from 4.7.9 to 4.7.12 reproduced issue.

A must-gather is not working for me on cluster.

Comment 2 mark freer 2021-05-31 12:22:17 UTC
(In reply to mark freer from comment #1)
> @jfajersk Here are the results of running the following
> Prometheus query 
> 
> Links to cluster:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.
> expr=kube_pod_container_status_running&g0.tab=0 cluster in production having
> the issue
> 
> Link:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.
> expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first
> upgrade performed from 4.7.9 to 4.7.12 reproduced issue.
> 
> A must-gather is not working for me on cluster.

Comment 4 Simon Pasquier 2021-06-16 07:38:41 UTC
The kube_pod_container_status_running metric has been excluded from the list of metrics exposed by kube-state-metrics [1] because it has a too high cardinality and it wasn't used in any dashboard/rule shipped by the monitoring stack. Depending on the use case, you might use another metric, like kube_pod_container_info or kube_pod_container_status_ready, to achieve the same result. Can you explain in more details what you used the metrics for?

In general, we don't provide any guarantee on metrics stability.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/e3bce4162877ff73b499a5f2a715125f929b9948/assets/kube-state-metrics/deployment.yaml#L47

Comment 6 Jan Fajerski 2021-06-29 13:11:33 UTC
From an email conversation:

Customer updated following detail :
This metric is used in the dashboard to show the amount of pods running in the namespace.
ex:  count (rate(kube_pod_container_status_running{env=~'$env',org='paas',container!="POD"} [10m])) by (namespace)


I would recommend the kube_pod_container_status_ready metric for this purpose. Arguably this is better metric anyway, since a pod can be running but not accepting traffic (making it not ready).

This query looks strange though, maybe the customer wants to revisit this. I don't understand the function of the rate function. As per the docs, rate "calculates the per-second average rate of increase of the time series in the range vector". A status metric however is either 0 or 1, so rate is a strange choice here.

Unless I overlook something here, I'd recommend something like sum(kube_pod_container_status_ready) by(namespace) (needs the desired label matching still).

Comment 7 Jan Fajerski 2021-06-30 08:44:31 UTC
Customer ended up using kube_pod_status_phase metric and it worked fine as per their requirement. Closing this accordingly, feel free to re-open if any more questions arise. Closing this as NOTABUG since the removal was intentional and other metrics can replace the removed metric.


Note You need to log in before you can comment on or make changes to this bug.