Bug 2109509
Summary: | sum_irate doesn't work in OCP 4.8 | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Danila Kiselev <dkiselev> | |
Component: | Management Console | Assignee: | Juan Rodriguez <jrodrig> | |
Status: | CLOSED WORKSFORME | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.8 | CC: | amuller, anpicker, dgautam, jrodrig, rludva, spasquie | |
Target Milestone: | --- | |||
Target Release: | 4.12.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | wip | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2112912 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-03 03:09:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2112912 |
Description
Danila Kiselev
2022-07-21 13:00:11 UTC
FYI: checked with 4.8.46, no node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate in cluster # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate no result also checked in 4.11.0-0.nightly-2022-07-26-154822, node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is defined in # oc -n openshift-monitoring get prometheusrules kubernetes-monitoring-rules -oyaml ... - name: k8s.rules rules: - expr: | sum by (cluster, namespace, pod, container) ( irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""}) ) record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate but no such record rule in 4.8.46 # oc get prometheusrules -A -oyaml | grep "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate" no result then we should consider if we need to back port the record rule to 4.8 node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is in 4.9 file: https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/assets/control-plane/prometheus-rule.yaml#L530-L538 not found in https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/assets/control-plane/prometheus-rule.yaml That metric is defined in jsonnet/vendor/github.com/kubernetes-monitoring/kubernetes-mixin/rules/apps.libsonnet, which we use through the dependency chain CMO -> kube-prometheus -> node_exporter/docs/node-mixin -> kubernetes-monitoring/kubernetes-mixin. At jsonnet/jsonnetfile.json in CMO repo we changed the version of kube-prometheus between releases: branch release-4.8 depends on "release-0.8" and branch depends on "release-0.9". Transitive dependency resolution leads to [jsonnetfile.lock.json for 4.8](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/7d3bb79a4983052d421264a7e0f3c9b0d4a22268/rules/apps.libsonnet) that defines `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate` and [jsonnetfile.lock.json for 4.9](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/fb9d8ed4bc4a3d6efac525f72e8a0d2c583a0fe2/rules/apps.libsonnet) that replaces that rule with a rule for `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate`. That explains why the irate version is available in 4.9 and the rate version is available in 4.8 However, those differences are not shown in the UI. As Simon pointed out to me, https://github.com/openshift/console/pull/10396/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 for release-4.9 was cherry picked for release-4.8 in https://github.com/openshift/console/pull/10496/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 but without replacing irate with rate the bug doesn't exist on 4.12 but we need QE to verify that the CPU usage dashboards returns data. checked in 4.12.0-0.nightly-2022-08-01-110026, no issue for CPU usage of deployments, see the attached picture, also checked the dashboards on console, the following dashboards use "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate", no issue for the dashboards either 4.12, namespace: openshift-config-managed configmap: grafana-dashboard-k8s-resources-cluster, dashboard: "Kubernetes / Compute Resources / Cluster" configmap: grafana-dashboard-k8s-resources-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Pods)" configmap: grafana-dashboard-k8s-resources-node, dashboard: "Kubernetes / Compute Resources / Node (Pods)" configmap: grafana-dashboard-k8s-resources-pod, dashboard: "Kubernetes / Compute Resources / Pod" configmap: grafana-dashboard-k8s-resources-workload, dashboard: "Kubernetes / Compute Resources / Workload" configmap: grafana-dashboard-k8s-resources-workloads-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Workloads)" @Simon, since the target release is 4.1.20, shall we close it as NOTABUG? (In reply to Junqi Zhao from comment #7) > @Simon, since the target release is 4.1.20, shall we close it as NOTABUG? should be since the target release is 4.12.0 @Junqi we need to fix this in 4.8 so we'll have to verify for all releases until 4.9. Maybe WORKSFORME is a better choice? since the issue is not exist in 4.12, close it as WORKSFORME While this issue was closed as WORKSFORME in OCP v4.9 and newer, it still exists in OCP v4.8, and the metrics are unavailable for CPU and deployments. Do you want to create a new Jira ticket for this bug or should we expect that this bug will not be fixed as OCP v4.8 is in "Maintenance support"? @rludva see https://bugzilla.redhat.com/show_bug.cgi?id=2112999 for the fix for 4.8. https://github.com/openshift/console/pull/11917 for branch release-4.8 was merged today |