Bug 2028071
| Summary: | [ACM Observability] Discrepancy in the values displayed by searching for "record" and the corresponding "expression" on grafana dashboard | ||
|---|---|---|---|
| Product: | Red Hat Advanced Cluster Management for Kubernetes | Reporter: | Mihir Lele <mlele> |
| Component: | Core Services / Observability | Assignee: | Chunlin Yang <chuyang> |
| Status: | CLOSED WONTFIX | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | rhacm-2.3.z | CC: | cedric.girard, crizzo, feven, juhsu, lcao, llan |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | rhacm-2.4.5 | Flags: | bot-tracker-sync:
rhacm-2.4.z+
bot-tracker-sync: needinfo+ |
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-06-07 13:02:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Mihir Lele
2021-12-01 12:45:38 UTC
G2Bsync 988556605 comment morvencao Wed, 08 Dec 2021 07:11:32 UTC G2Bsync This should be caused by the same issue as https://github.com/open-cluster-management/backlog/issues/18149, Bugzilla 2026577. *** This bug has been marked as a duplicate of bug 2026577 *** I am re-opening this bz. The customer upgraded to 2.4.2 but the issue still persists. regards, Mihir Comment from morvencao:
We need to know the scrape interval that the use set in MCO CR, it can be retrieved by:
oc get mco observability -o jsonpath="{.spec.observabilityAddonSpec.interval}"
We changed the default value from 30 to 300 seconds since 2.4, customer may still using old default value(30s).
G2Bsync 1064750497 comment
morvencao Fri, 11 Mar 2022 04:04:41 UTC
G2Bsync We need to know the scrape interval that the use set in MCO CR, it can be retrieved by:
```
oc get mco observability -o jsonpath="{.spec.observabilityAddonSpec.interval}"
```
We changed the default value from `30` to `300` seconds since 2.4, customer may still using old default value(30s).
I was unable to reproduce this, but I checked the chat history with Mihir, he reproduced it in an ACM env with an unstable managed cluster. I suspect this is related to the openshift prometheus, so I involve openshift monitoring team for this in slack channel, I will get back here once we have investigation results. @mlele Did you check if the clock of the managed cluster is Synchronized with the hub cluster. The metric are time series data, so the query in hub cluster will be different from the query at managed cluster if the clock is not Synchronized. I checked your new added `recordingrule` in observability-metrics-custom-allowlist:
```
- record: custom_service_cost_hour
expr: 27.78/avg_over_time(clamp_min(kube_namespace_labels,scalar(count(kube_namespace_labels{namespace!~\"openshift.*|kube.*\"}))) [1h:1m])
- record: custom_platform_cost_hour
expr: avg_over_time(clamp_min(kube_namespace_labels,scalar(sum(kube_node_status_condition{condition=\"Ready\", status=\"true\", node!~\".*worker.*\"}==1))) [1h:1m]) * 0.37
```
`avg_over_time` calculates distinct average values for every selected time series, I believe the query result will be different when the raw samples are not Synchronized. Have you tried other expression without avg_over_time? can you reproduce it with that?
@mlele Could you set the query type to RANGE when you query the original expression and the new record? so that we can get the history data to analyze if there are gaps between samples. I can't see any managed cluster and metrics after I login the env with my google account. Could you give me the permission? BTW, looks like the issued recording rules(the 3rd and 4th record) do have "avg_over_time", which calculates distinct average values for every selected time series and the query results will be different because the raw samples(data points) from ACM side are much less than the original openshift's prometheus. I have added custom_sum_namespace_cpu_usage_cost in the configmap I found the `custom_sum_namespace_cpu_usage_cost` query from Grafana only contains data from OCP 4.8 managed cluster. There are some changes against the metrics of openshift cluster from OCP version 4.9. The metrics `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate` used in the recordingrule doesn't existing in OCP 4.9+ cluster any more. You can change it to the irate version metric `node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate` and try again. I update the expr for recordingrule `custom_sum_namespace_cpu_usage_cost` from
```
(sum(kube_pod_container_resource_requests{resource=\"cpu\"}) by (cluster, namespace)
>
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{}) by (cluster, namespace)
or
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{}) by (cluster, namespace)
)* 0.133
```
to
```
(sum(kube_pod_container_resource_requests{resource=\"cpu\"}) by (cluster, namespace)
>
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{}) by (cluster, namespace)
or
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{}) by (cluster, namespace)
)* 0.133
```
Now the recordingrule query and original expression query results are almost the same, with only slight difference.
After check the range query data, I found some metrics-collector missed some data, not sure if caused by network issue or other reasons.
For other recordingrules with `avg_over_time`, as I mentioned, the different query results are as expected, because the raw samples(data points) from ACM side are much less than the original openshift's prometheus.
@llan @smeduri add Marco and Subbarao to see if we have plan to fix this in the future. But for now, this is not easy to fix the recordingrules that contains `avg_over_time`, because the evaluations are at different places, one is at managed cluster with original openshift raw samples, another is at ACM hub side, the raw samples are much less. G2Bsync 1160908730 comment marcolan018 Mon, 20 Jun 2022 22:37:16 UTC G2Bsync For the recordingrules with `avg_over_time`, it will have different value than expression because they will calculate average value based on different original data set, as LongLong @morvencao mentioned above. The recording rule calculates the average value based on the data set in OCP Prometheus, which has scrape interval about 30 seconds. In Grafana side, if use the original expression to query, it calculates the average value based on the data stored in ACM hub, which scraped in default 300 seconds. This means the query result of recording rule reflects more accurate status of that metrics and this is the desired behavior, which means user can get more details of the managed clusters without scape too much time series data from them and that's the purpose why we introduce the feature of recording rule. To compare the query result of recording rule, should use the expression on OCP Prometheus side in target managed cluster, but not on Hub side. Those two results should be same/similiar. If customers insists to get same/similar result for the queries on recording rule and related expression, they need to add the recording rule as custom rule(https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.4/html-single/observability/index#creating-custom-rules), but not in custom allow list. |