Created attachment 1788215 [details] prometheus graph Description of problem: Version-Release Name: 4.7.12 Digest: sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e Created: 2021-05-20T18:48:00Z OS/Arch: linux/amd64 Manifests: 481 Metadata files: 1 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:2029c5779202293f23418d47a1a823c4e4c8539c1ab25e8bda30d48335b4892e How to reproduce: Steps to Reproduce: 1. To reproduce create a cluster on 4.7.9 and upgrade from 4.7.9 to 4.7.12 2. Add some data Prometheus and Run this query Prometheus (kube_pod_container_status_running) make sure data is present before upgrade. 3. See that no kube_pod_container_status_running data is present after the ocp 4.7.9 to 4.7.12 in Prometheus Actual results: absent(kube_pod_container_status_running) is returning 1 from the upgrade time Expected results: kube_pod_container_status_running is missing metrics from upgrade time Additional info:
@jfajersk Here are the results of running the following Prometheus query Links to cluster: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0.expr=kube_pod_container_status_running&g0.tab=0 cluster in product having the issue Link: https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps.cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0.expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first upgrade performed from 4.7.9 to 4.7.12 reproduced issue. A must-gather is not working for me on cluster.
(In reply to mark freer from comment #1) > @jfajersk Here are the results of running the following > Prometheus query > > Links to cluster: > https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps. > enento-3scale.ze05.p1.openshiftapps.com/graph?g0.range_input=1h&g0. > expr=kube_pod_container_status_running&g0.tab=0 cluster in production having > the issue > > Link: > https://prometheus-route-redhat-rhoam-middleware-monitoring-operator.apps. > cee-osd4.qvgs.p1.openshiftapps.com/graph?g0.range_input=2d&g0. > expr=kube_pod_container_status_running&g0.tab=0 Support cluster with first > upgrade performed from 4.7.9 to 4.7.12 reproduced issue. > > A must-gather is not working for me on cluster.
The kube_pod_container_status_running metric has been excluded from the list of metrics exposed by kube-state-metrics [1] because it has a too high cardinality and it wasn't used in any dashboard/rule shipped by the monitoring stack. Depending on the use case, you might use another metric, like kube_pod_container_info or kube_pod_container_status_ready, to achieve the same result. Can you explain in more details what you used the metrics for? In general, we don't provide any guarantee on metrics stability. [1] https://github.com/openshift/cluster-monitoring-operator/blob/e3bce4162877ff73b499a5f2a715125f929b9948/assets/kube-state-metrics/deployment.yaml#L47
From an email conversation: Customer updated following detail : This metric is used in the dashboard to show the amount of pods running in the namespace. ex: count (rate(kube_pod_container_status_running{env=~'$env',org='paas',container!="POD"} [10m])) by (namespace) I would recommend the kube_pod_container_status_ready metric for this purpose. Arguably this is better metric anyway, since a pod can be running but not accepting traffic (making it not ready). This query looks strange though, maybe the customer wants to revisit this. I don't understand the function of the rate function. As per the docs, rate "calculates the per-second average rate of increase of the time series in the range vector". A status metric however is either 0 or 1, so rate is a strange choice here. Unless I overlook something here, I'd recommend something like sum(kube_pod_container_status_ready) by(namespace) (needs the desired label matching still).
Customer ended up using kube_pod_status_phase metric and it worked fine as per their requirement. Closing this accordingly, feel free to re-open if any more questions arise. Closing this as NOTABUG since the removal was intentional and other metrics can replace the removed metric.