Bug 1966104

Summary:

[OSD] kube_pod_container_status_running stopped working after upgrade

Product:

OpenShift Container Platform

Reporter:

Ron Green <rogreen>

Component:

Monitoring

Assignee:

Jan Fajerski <jfajersk>

Status:

CLOSED NOTABUG

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.7

CC:

alegrand, anisal, anpicker, aos-bugs, erooth, gferrazs, jfajersk, kakkoyun, mfreer, oarribas, pkrupa, pnair, spasquie

Target Milestone:

---

Flags:

rogreen: needinfo-

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-06-30 08:44:31 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
prometheus graph	none

Description Ron Green 2021-05-31 11:46:58 UTC

Created attachment 1788215 [details]
prometheus graph

Description of problem:
Version-Release  Name:           4.7.12
Digest:         sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e
Created:        2021-05-20T18:48:00Z
OS/Arch:        linux/amd64
Manifests:      481
Metadata files: 1
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e

How to reproduce:
Steps to Reproduce:
1. To reproduce create a cluster on 4.7.9 and upgrade from 4.7.9 to 4.7.12
2. Add some data Prometheus and Run this query Prometheus
 (kube_pod_container_status_running) make sure data is present before upgrade. 
3. See that no kube_pod_container_status_running data is present after the ocp 4.7.9 to 4.7.12 in Prometheus

Actual results:
absent(kube_pod_container_status_running) is returning 1 from the upgrade time
Expected results:
kube_pod_container_status_running is missing metrics from upgrade time
Additional info:

Comment 1 mark freer 2021-05-31 12:18:44 UTC

@jfajersk Here are the results of running the following Prometheus query 

Links to cluster: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.expr=kube_pod_container_status_running&g0.tab=0 cluster in product having the issue

Link: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first upgrade performed from 4.7.9 to 4.7.12 reproduced issue.

A must-gather is not working for me on cluster.

Comment 2 mark freer 2021-05-31 12:22:17 UTC

(In reply to mark freer from comment #1)
> @jfajersk Here are the results of running the following
> Prometheus query 
> 
> Links to cluster:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.
> expr=kube_pod_container_status_running&g0.tab=0 cluster in production having
> the issue
> 
> Link:
> https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.
> cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.
> expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first
> upgrade performed from 4.7.9 to 4.7.12 reproduced issue.
> 
> A must-gather is not working for me on cluster.

Comment 4 Simon Pasquier 2021-06-16 07:38:41 UTC

The kube_pod_container_status_running metric has been excluded from the list of metrics exposed by kube-state-metrics [1] because it has a too high cardinality and it wasn't used in any dashboard/rule shipped by the monitoring stack. Depending on the use case, you might use another metric, like kube_pod_container_info or kube_pod_container_status_ready, to achieve the same result. Can you explain in more details what you used the metrics for?

In general, we don't provide any guarantee on metrics stability.

[1] https://github.com/openshift/cluster-monitoring-operator/blob/e3bce4162877ff73b499a5f2a715125f929b9948/assets/kube-state-metrics/deployment.yaml#L47

Comment 6 Jan Fajerski 2021-06-29 13:11:33 UTC

From an email conversation:

Customer updated following detail :
This metric is used in the dashboard to show the amount of pods running in the namespace.
ex:  count (rate(kube_pod_container_status_running{env=~'$env',org='paas',container!="POD"} [10m])) by (namespace)


I would recommend the kube_pod_container_status_ready metric for this purpose. Arguably this is better metric anyway, since a pod can be running but not accepting traffic (making it not ready).

This query looks strange though, maybe the customer wants to revisit this. I don't understand the function of the rate function. As per the docs, rate "calculates the per-second average rate of increase of the time series in the range vector". A status metric however is either 0 or 1, so rate is a strange choice here.

Unless I overlook something here, I'd recommend something like sum(kube_pod_container_status_ready) by(namespace) (needs the desired label matching still).

Comment 7 Jan Fajerski 2021-06-30 08:44:31 UTC

Customer ended up using kube_pod_status_phase metric and it worked fine as per their requirement. Closing this accordingly, feel free to re-open if any more questions arise. Closing this as NOTABUG since the removal was intentional and other metrics can replace the removed metric.