Bug 2109509 - sum_irate doesn't work in OCP 4.8
Summary: sum_irate doesn't work in OCP 4.8
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Juan Rodriguez
QA Contact: Junqi Zhao
URL:
Whiteboard: wip
Depends On:
Blocks: 2112912
TreeView+ depends on / blocked
 
Reported: 2022-07-21 13:00 UTC by Danila Kiselev
Modified: 2022-08-29 13:19 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2112912 (view as bug list)
Environment:
Last Closed: 2022-08-03 03:09:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Danila Kiselev 2022-07-21 13:00:11 UTC
Description of problem:

sum_irate (which works in OCP 4.9) is not working in OCP 4.8, specifically in CPU Graph in the Deployment section. 


Version-Release number of selected component (if applicable):
Openshift Container Platform 4.8


How reproducible:
Very, tested on Quicklab


Steps to Reproduce:

1. Query 

~~~
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
~~~


Actual results:

No datapoints found.


Expected results:

Graph of cpu usage.


Additional info:

sum_rate works on 4.8 and doesn't work on 4.9.

Comment 2 Junqi Zhao 2022-07-27 10:09:11 UTC
FYI: checked with 4.8.46, no node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate in cluster
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
no result

also checked in 4.11.0-0.nightly-2022-07-26-154822, node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is defined in 
# oc -n openshift-monitoring get prometheusrules kubernetes-monitoring-rules -oyaml
...
  - name: k8s.rules
    rules:
    - expr: |
        sum by (cluster, namespace, pod, container) (
          irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
        ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
          1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
        )
      record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

but no such record rule in 4.8.46
# oc get prometheusrules -A -oyaml | grep "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate"
no result

then we should consider if we need to back port the record rule to 4.8

Comment 4 Juan Rodriguez 2022-08-01 09:50:54 UTC
That metric is defined in jsonnet/vendor/github.com/kubernetes-monitoring/kubernetes-mixin/rules/apps.libsonnet, which we use through the dependency chain CMO -> kube-prometheus -> node_exporter/docs/node-mixin -> kubernetes-monitoring/kubernetes-mixin. At jsonnet/jsonnetfile.json in CMO repo we changed the version of kube-prometheus between releases: branch release-4.8 depends on "release-0.8" and branch depends on "release-0.9". Transitive dependency resolution leads to [jsonnetfile.lock.json for 4.8](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/7d3bb79a4983052d421264a7e0f3c9b0d4a22268/rules/apps.libsonnet) that defines `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate` and [jsonnetfile.lock.json for 4.9](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/fb9d8ed4bc4a3d6efac525f72e8a0d2c583a0fe2/rules/apps.libsonnet) that replaces that rule with a rule for `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate`. That explains why the irate version is available in 4.9 and the rate version is available in 4.8

However, those differences are not shown in the UI. As Simon pointed out to me, https://github.com/openshift/console/pull/10396/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 for release-4.9 was cherry picked for release-4.8 in https://github.com/openshift/console/pull/10496/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 but without replacing irate with rate

Comment 5 Simon Pasquier 2022-08-01 13:37:00 UTC
the bug doesn't exist on 4.12 but we need QE to verify that the CPU usage dashboards returns data.

Comment 7 Junqi Zhao 2022-08-02 03:22:12 UTC
checked in 4.12.0-0.nightly-2022-08-01-110026, no issue for CPU usage of deployments, see the attached picture, also checked the dashboards on console, the following dashboards use "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate", no issue for the dashboards either
4.12, namespace: openshift-config-managed
configmap: grafana-dashboard-k8s-resources-cluster, dashboard: "Kubernetes / Compute Resources / Cluster"
configmap: grafana-dashboard-k8s-resources-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Pods)"
configmap: grafana-dashboard-k8s-resources-node, dashboard: "Kubernetes / Compute Resources / Node (Pods)"
configmap: grafana-dashboard-k8s-resources-pod, dashboard: "Kubernetes / Compute Resources / Pod"
configmap: grafana-dashboard-k8s-resources-workload, dashboard: "Kubernetes / Compute Resources / Workload"
configmap: grafana-dashboard-k8s-resources-workloads-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Workloads)"

@Simon, since the target release is 4.1.20, shall we close it as NOTABUG?

Comment 9 Junqi Zhao 2022-08-02 03:26:22 UTC
(In reply to Junqi Zhao from comment #7)
> @Simon, since the target release is 4.1.20, shall we close it as NOTABUG?
should be
since the target release is 4.12.0

Comment 10 Simon Pasquier 2022-08-02 14:21:57 UTC
@Junqi we need to fix this in 4.8 so we'll have to verify for all releases until 4.9. Maybe WORKSFORME is a better choice?

Comment 11 Junqi Zhao 2022-08-03 03:09:28 UTC
since the issue is not exist in 4.12, close it as WORKSFORME

Comment 12 Radomir Ludva 2022-08-22 08:24:37 UTC
While this issue was closed as WORKSFORME in OCP v4.9 and newer, it still exists in OCP v4.8, and the metrics are unavailable for CPU and deployments. Do you want to create a new Jira ticket for this bug or should we expect that this bug will not be fixed as OCP v4.8 is in  "Maintenance support"?

Comment 13 Juan Rodriguez 2022-08-29 13:19:32 UTC
@rludva see https://bugzilla.redhat.com/show_bug.cgi?id=2112999 for the fix for 4.8. https://github.com/openshift/console/pull/11917 for branch release-4.8 was merged today


Note You need to log in before you can comment on or make changes to this bug.