Bug 2008166 - Recording Rules which uses "up" metrics showing incorrect output.
Summary: Recording Rules which uses "up" metrics showing incorrect output.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Metrics
Version: 4.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.1
Assignee: Erkan Erol
QA Contact: Debarati Basu-Nag
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-27 13:29 UTC by Satyajit Bulage
Modified: 2021-12-13 19:59 UTC (History)
6 users (show)

Fixed In Version: v4.9.1-15
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-13 19:59:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt cluster-network-addons-operator pull 1053 0 None Merged Use non-empty values in labels for Prometheus 2021-10-26 06:18:46 UTC
Github kubevirt cluster-network-addons-operator pull 1058 0 None Merged [release-0.58] Use non-empty values in labels for Prometheus 2021-10-26 06:18:46 UTC
Github kubevirt kubevirt pull 6570 0 None Merged Use honorLabels instead of labelDrop 2021-10-13 11:23:51 UTC
Github kubevirt kubevirt pull 6588 0 None Merged Fix recording rules based on up metrics 2021-10-27 09:49:54 UTC
Github kubevirt kubevirt pull 6683 0 None Merged [release-0.44] Fix issues in up metrics and alerts based on namespace labels 2021-11-08 12:53:30 UTC
Red Hat Product Errata RHBA-2021:5091 0 None None None 2021-12-13 19:59:17 UTC

Internal Links: 2026431

Description Satyajit Bulage 2021-09-27 13:29:44 UTC
Description of problem:Following recording rules are showing incorrect values: "kubevirt_virt_api_up_total, kubevirt_virt_operator_up_total, kubevirt_virt_handler_up_total, kubevirt_virt_controller_up_total"



Version-Release number of selected component (if applicable):4.9.0


How reproducible:100%


Steps to Reproduce:
1.Execute above metrics
2.
3.

Actual results: Showing incorrect values when execute.


Expected results: Above metrics should show accurate/correct values.


Additional info:

Comment 3 Erkan Erol 2021-09-28 08:28:41 UTC
Once upon a time, I found something similar and opened a Github issue for that but it seems it was ignored and closed in time. See https://github.com/kubevirt/kubevirt/issues/5383

Comment 4 Erkan Erol 2021-09-29 06:41:48 UTC
We decided to fix this bug for 4.9.1

Comment 5 Erkan Erol 2021-10-11 11:28:32 UTC
I found that the issue is more complicated than we thought. 

First of all, as I stated in the github issue above (https://github.com/kubevirt/kubevirt/issues/5383), we use labeldrop configuration for namespace label to be able to keep namespace labels on workload metrics as they are. When workload metrics have correct namespace (the namespace of workload instead of controlplane), non-admin users in OpenShift can see them. This is implemented with this PR in the past: https://github.com/kubevirt/kubevirt/pull/3125  Since that PR, prometheus doesn't add/change namespace labels on kubevirt's metrics and we have correct namespaces (added by our control plane) on workload metrics. As a side effect, we don't have namespace labels on the metrics provided by Prometheus such as "up" and it broke our recording rules and alert definitions depend on those metrics, which has to be fixed.

I talked with monitoring team and they said it is safe to use "honorLabels" now. I did some tests with it and it gives the expected result: All metrics have namespace labels (including "up") and the workload metrics have correct namespace label. I PROPOSED TO REMOVE LABELDROP AND USE HONORLABELS TO FIX THIS ISSUE.


Secondly,  as you can see in the attached screenshot shared by Satya, "up" metric appears twice per pod in the UI&Prometheus. One is UP and the other one is DOWN. This is another issue we need to solve. When I checked the ServiceMonitor objects and Prometheus configuration, I noticed that Prometheus adds all endpoints in controlplane's namespace twice for servicemonitor of kubevirt and cluster-network-addons. Kubevirt's one supports HTTPS, the other one supports HTTP, that is why we observe two record, one is UP and the other is DOWN (Since Prometheus cannot access endpoints with wrong protocol).  I contacted with monitoring team and they found an issue in upstream Prometheus repository. See https://github.com/prometheus-operator/prometheus-operator/issues/4325 Even when we fix the issue above, we will still observe wrong data for our recording rules and alerts. The workaround for that bug is using labels that have non-empty values. Example:  instead of prometheus.kubevirt.io: "", we can use prometheus.kubevirt.io: "true".

Comment 6 Shirly Radco 2021-10-19 18:30:03 UTC
Erkan, Please implement this w/a.

Comment 7 Erkan Erol 2021-10-21 09:21:26 UTC
I opened two PRs as a workaround for prometheus-operator issue.

https://github.com/kubevirt/kubevirt/pull/6652
https://github.com/kubevirt/cluster-network-addons-operator/pull/1053

Comment 8 Ram Lavi 2021-10-25 16:24:42 UTC
CNAO stable branch PR https://github.com/kubevirt/cluster-network-addons-operator/pull/1058

Comment 9 Ram Lavi 2021-10-26 14:40:21 UTC
fix merged on d/s on CNAO branch cnv-4.9-rhel-8 https://code.engineering.redhat.com/gerrit/c/cluster-network-addons-operator/+/285141

Comment 12 Debarati Basu-Nag 2021-11-22 20:31:10 UTC
Verified against 4.9.1, able to query the following recording rules, and they are showing correct values: "kubevirt_virt_api_up_total, kubevirt_virt_operator_up_total, kubevirt_virt_handler_up_total, kubevirt_virt_controller_up_total" associated with the query.

Comment 18 errata-xmlrpc 2021-12-13 19:59:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.9.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5091


Note You need to log in before you can comment on or make changes to this bug.