Bug 2109509

Summary:	sum_irate doesn't work in OCP 4.8
Product:	OpenShift Container Platform	Reporter:	Danila Kiselev <dkiselev>
Component:	Management Console	Assignee:	Juan Rodriguez <jrodrig>
Status:	CLOSED WORKSFORME	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.8	CC:	amuller, anpicker, dgautam, jrodrig, rludva, spasquie
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	wip
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2112912 (view as bug list)		Environment:
Last Closed:	2022-08-03 03:09:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2112912

Description Danila Kiselev 2022-07-21 13:00:11 UTC

Description of problem:

sum_irate (which works in OCP 4.9) is not working in OCP 4.8, specifically in CPU Graph in the Deployment section. 


Version-Release number of selected component (if applicable):
Openshift Container Platform 4.8


How reproducible:
Very, tested on Quicklab


Steps to Reproduce:

1. Query 

~~~
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
~~~


Actual results:

No datapoints found.


Expected results:

Graph of cpu usage.


Additional info:

sum_rate works on 4.8 and doesn't work on 4.9.

Comment 2 Junqi Zhao 2022-07-27 10:09:11 UTC

FYI: checked with 4.8.46, no node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate in cluster
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
no result

also checked in 4.11.0-0.nightly-2022-07-26-154822, node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is defined in 
# oc -n openshift-monitoring get prometheusrules kubernetes-monitoring-rules -oyaml
...
  - name: k8s.rules
    rules:
    - expr: |
        sum by (cluster, namespace, pod, container) (
          irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
        ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
          1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
        )
      record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

but no such record rule in 4.8.46
# oc get prometheusrules -A -oyaml | grep "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate"
no result

then we should consider if we need to back port the record rule to 4.8

Comment 3 Junqi Zhao 2022-07-27 10:11:56 UTC

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is in 4.9 file:
https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/assets/control-plane/prometheus-rule.yaml#L530-L538

not found in
https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/assets/control-plane/prometheus-rule.yaml

Comment 4 Juan Rodriguez 2022-08-01 09:50:54 UTC

That metric is defined in jsonnet/vendor/github.com/kubernetes-monitoring/kubernetes-mixin/rules/apps.libsonnet, which we use through the dependency chain CMO -> kube-prometheus -> node_exporter/docs/node-mixin -> kubernetes-monitoring/kubernetes-mixin. At jsonnet/jsonnetfile.json in CMO repo we changed the version of kube-prometheus between releases: branch release-4.8 depends on "release-0.8" and branch depends on "release-0.9". Transitive dependency resolution leads to [jsonnetfile.lock.json for 4.8](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/7d3bb79a4983052d421264a7e0f3c9b0d4a22268/rules/apps.libsonnet) that defines `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate` and [jsonnetfile.lock.json for 4.9](https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/jsonnet/jsonnetfile.lock.json) using a [version of kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/fb9d8ed4bc4a3d6efac525f72e8a0d2c583a0fe2/rules/apps.libsonnet) that replaces that rule with a rule for `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate`. That explains why the irate version is available in 4.9 and the rate version is available in 4.8

However, those differences are not shown in the UI. As Simon pointed out to me, https://github.com/openshift/console/pull/10396/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 for release-4.9 was cherry picked for release-4.8 in https://github.com/openshift/console/pull/10496/files#diff-cc6bcbaa4f8a10821369cfc3ac32ab167abd338cd628729709c3c6db49befd6eR42 but without replacing irate with rate

Comment 5 Simon Pasquier 2022-08-01 13:37:00 UTC

the bug doesn't exist on 4.12 but we need QE to verify that the CPU usage dashboards returns data.

Comment 7 Junqi Zhao 2022-08-02 03:22:12 UTC

checked in 4.12.0-0.nightly-2022-08-01-110026, no issue for CPU usage of deployments, see the attached picture, also checked the dashboards on console, the following dashboards use "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate", no issue for the dashboards either
4.12, namespace: openshift-config-managed
configmap: grafana-dashboard-k8s-resources-cluster, dashboard: "Kubernetes / Compute Resources / Cluster"
configmap: grafana-dashboard-k8s-resources-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Pods)"
configmap: grafana-dashboard-k8s-resources-node, dashboard: "Kubernetes / Compute Resources / Node (Pods)"
configmap: grafana-dashboard-k8s-resources-pod, dashboard: "Kubernetes / Compute Resources / Pod"
configmap: grafana-dashboard-k8s-resources-workload, dashboard: "Kubernetes / Compute Resources / Workload"
configmap: grafana-dashboard-k8s-resources-workloads-namespace, dashboard: "Kubernetes / Compute Resources / Namespace (Workloads)"

@Simon, since the target release is 4.1.20, shall we close it as NOTABUG?

Comment 9 Junqi Zhao 2022-08-02 03:26:22 UTC

(In reply to Junqi Zhao from comment #7)
> @Simon, since the target release is 4.1.20, shall we close it as NOTABUG?
should be
since the target release is 4.12.0

Comment 10 Simon Pasquier 2022-08-02 14:21:57 UTC

@Junqi we need to fix this in 4.8 so we'll have to verify for all releases until 4.9. Maybe WORKSFORME is a better choice?

Comment 11 Junqi Zhao 2022-08-03 03:09:28 UTC

since the issue is not exist in 4.12, close it as WORKSFORME

Comment 12 Radomir Ludva 2022-08-22 08:24:37 UTC

While this issue was closed as WORKSFORME in OCP v4.9 and newer, it still exists in OCP v4.8, and the metrics are unavailable for CPU and deployments. Do you want to create a new Jira ticket for this bug or should we expect that this bug will not be fixed as OCP v4.8 is in  "Maintenance support"?

Comment 13 Juan Rodriguez 2022-08-29 13:19:32 UTC

@rludva see https://bugzilla.redhat.com/show_bug.cgi?id=2112999 for the fix for 4.8. https://github.com/openshift/console/pull/11917 for branch release-4.8 was merged today