Bug 1966104

Summary: [OSD] kube_pod_container_status_running stopped working after upgrade
Product: OpenShift Container Platform Reporter: Ron Green <rogreen>
Component: MonitoringAssignee: Jan Fajerski <jfajersk>
Status: CLOSED NOTABUG QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: alegrand, anisal, anpicker, aos-bugs, erooth, gferrazs, jfajersk, kakkoyun, mfreer, oarribas, pkrupa, pnair, spasquie
Target Milestone: ---Flags: rogreen: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-30 08:44:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
prometheus graph none

Description Ron Green 2021-05-31 11:46:58 UTC
Created attachment 1788215 [details]
prometheus graph

Description of problem:
Version-Release  Name:           4.7.12
Digest:         sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e
Created:        2021-05-20T18:48:00Z
OS/Arch:        linux/amd64
Manifests:      481
Metadata files: 1
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e

How to reproduce:
Steps to Reproduce:
1. To reproduce create a cluster on 4.7.9 and upgrade from 4.7.9 to 4.7.12
2. Add some data Prometheus and Run this query Prometheus
 (kube_pod_container_status_running) make sure data is present before upgrade. 
3. See that no kube_pod_container_status_running data is present after the ocp 4.7.9 to 4.7.12 in Prometheus

Actual results:
absent(kube_pod_container_status_running) is returning 1 from the upgrade time
Expected results:
kube_pod_container_status_running is missing metrics from upgrade time
Additional info:

Comment 1 mark freer 2021-05-31 12:18:44 UTC
@jfajersk Here are the results of running the following Prometheus query 

Links to cluster: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.expr=kube_pod_container_status_running&g0.tab=0 cluster in product having the issue

Link: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first upgrade performed from 4.7.9 to 4.7.12 reproduced issue.

A must-gather is not working for me on cluster.

Comment 2 mark freer 2021-05-31 12:22:17 UTC
(In reply to mark freer from comment #1)
> @jfajersk Here are the results of running the following
> Prometheus query 
> 
> Links to cluster:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.
> expr=kube_pod_container_status_running&g0.tab=0 cluster in production having
> the issue
> 
> Link:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.
> expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first
> upgrade performed from 4.7.9 to 4.7.12 reproduced issue.
> 
> A must-gather is not working for me on cluster.

Comment 4 Simon Pasquier 2021-06-16 07:38:41 UTC
The kube_pod_container_status_running metric has been excluded from the list of metrics exposed by kube-state-metrics [1] because it has a too high cardinality and it wasn't used in any dashboard/rule shipped by the monitoring stack. Depending on the use case, you might use another metric, like kube_pod_container_info or kube_pod_container_status_ready, to achieve the same result. Can you explain in more details what you used the metrics for?

In general, we don't provide any guarantee on metrics stability.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/e3bce4162877ff73b499a5f2a715125f929b9948/assets/kube-state-metrics/deployment.yaml#L47

Comment 6 Jan Fajerski 2021-06-29 13:11:33 UTC
From an email conversation:

Customer updated following detail :
This metric is used in the dashboard to show the amount of pods running in the namespace.
ex:  count (rate(kube_pod_container_status_running{env=~'$env',org='paas',container!="POD"} [10m])) by (namespace)


I would recommend the kube_pod_container_status_ready metric for this purpose. Arguably this is better metric anyway, since a pod can be running but not accepting traffic (making it not ready).

This query looks strange though, maybe the customer wants to revisit this. I don't understand the function of the rate function. As per the docs, rate "calculates the per-second average rate of increase of the time series in the range vector". A status metric however is either 0 or 1, so rate is a strange choice here.

Unless I overlook something here, I'd recommend something like sum(kube_pod_container_status_ready) by(namespace) (needs the desired label matching still).

Comment 7 Jan Fajerski 2021-06-30 08:44:31 UTC
Customer ended up using kube_pod_status_phase metric and it worked fine as per their requirement. Closing this accordingly, feel free to re-open if any more questions arise. Closing this as NOTABUG since the removal was intentional and other metrics can replace the removed metric.