Description of problem: sum_irate (which works in OCP 4.9) is not working in OCP 4.8, specifically in CPU Graph in the Deployment section. Version-Release number of selected component (if applicable): Openshift Container Platform 4.8 How reproducible: Very, tested on Quicklab Steps to Reproduce: 1. Query ~~~ node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate ~~~ Actual results: No datapoints found. Expected results: Graph of cpu usage. Additional info: sum_rate works on 4.8 and doesn't work on 4.9.
FYI: checked with 4.8.46, no node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate in cluster # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate no result also checked in 4.11.0-0.nightly-2022-07-26-154822, node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is defined in # oc -n openshift-monitoring get prometheusrules kubernetes-monitoring-rules -oyaml ... - name: k8s.rules rules: - expr: | sum by (cluster, namespace, pod, container) ( irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""}) ) record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate but no such record rule in 4.8.46 # oc get prometheusrules -A -oyaml | grep "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate" no result then we should consider if we need to back port the record rule to 4.8
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is in 4.9 file: https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/assets/control-plane/prometheus-rule.yaml#L530-L538 not found in https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/assets/control-plane/prometheus-rule.yaml
That metric is defined in jsonnet/vendor/github.com/kubernetes-monitoring/kubernetes-mixin/rules/apps.libsonnet, which we use through the dependency chain CMO -> kube-prometheus -> node_exporter/docs/node-mixin -> kubernetes-monitoring/kubernetes-mixin. At jsonnet/jsonnetfile.json in CMO repo we changed the version of kube-prometheus between releases: branch release-4.8 depends on "release-0.8" and branch depends on "release-0.9". Transitive dependency resolution leads to [jsonnetfile.lock.json for 4.8](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/7d3bb79a4983052d421264a7e0f3c9b0d4a22268/rules/apps.libsonnet) that defines `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate` and [jsonnetfile.lock.json for 4.9](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/fb9d8ed4bc4a3d6efac525f72e8a0d2c583a0fe2/rules/apps.libsonnet) that replaces that rule with a rule for `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate`. That explains why the irate version is available in 4.9 and the rate version is available in 4.8 However, those differences are not shown in the UI. As Simon pointed out to me, https://github.com/openshift/console/pull/10396/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 for release-4.9 was cherry picked for release-4.8 in https://github.com/openshift/console/pull/10496/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 but without replacing irate with rate
the bug doesn't exist on 4.12 but we need QE to verify that the CPU usage dashboards returns data.
checked in 4.12.0-0.nightly-2022-08-01-110026, no issue for CPU usage of deployments, see the attached picture, also checked the dashboards on console, the following dashboards use "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate", no issue for the dashboards either 4.12, namespace: openshift-config-managed configmap: grafana-dashboard-k8s-resources-cluster, dashboard: "Kubernetes / Compute Resources / Cluster" configmap: grafana-dashboard-k8s-resources-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Pods)" configmap: grafana-dashboard-k8s-resources-node, dashboard: "Kubernetes / Compute Resources / Node (Pods)" configmap: grafana-dashboard-k8s-resources-pod, dashboard: "Kubernetes / Compute Resources / Pod" configmap: grafana-dashboard-k8s-resources-workload, dashboard: "Kubernetes / Compute Resources / Workload" configmap: grafana-dashboard-k8s-resources-workloads-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Workloads)" @Simon, since the target release is 4.1.20, shall we close it as NOTABUG?
(In reply to Junqi Zhao from comment #7) > @Simon, since the target release is 4.1.20, shall we close it as NOTABUG? should be since the target release is 4.12.0
@Junqi we need to fix this in 4.8 so we'll have to verify for all releases until 4.9. Maybe WORKSFORME is a better choice?
since the issue is not exist in 4.12, close it as WORKSFORME
While this issue was closed as WORKSFORME in OCP v4.9 and newer, it still exists in OCP v4.8, and the metrics are unavailable for CPU and deployments. Do you want to create a new Jira ticket for this bug or should we expect that this bug will not be fixed as OCP v4.8 is in "Maintenance support"?
@rludva see https://bugzilla.redhat.com/show_bug.cgi?id=2112999 for the fix for 4.8. https://github.com/openshift/console/pull/11917 for branch release-4.8 was merged today