2008166 – Recording Rules which uses "up" metrics showing incorrect output.

Bug 2008166 - Recording Rules which uses "up" metrics showing incorrect output.

Summary: Recording Rules which uses "up" metrics showing incorrect output.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Metrics
Sub Component:
Version:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.1
Assignee:	Erkan Erol
QA Contact:	Debarati Basu-Nag
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-27 13:29 UTC by Satyajit Bulage
Modified:	2021-12-13 19:59 UTC (History)
CC List:	6 users (show)
Fixed In Version:	v4.9.1-15
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-13 19:59:01 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt cluster-network-addons-operator pull 1053	None	Merged	Use non-empty values in labels for Prometheus	2021-10-26 06:18:46 UTC
Github	kubevirt cluster-network-addons-operator pull 1058	None	Merged	[release-0.58] Use non-empty values in labels for Prometheus	2021-10-26 06:18:46 UTC
Github	kubevirt kubevirt pull 6570	None	Merged	Use honorLabels instead of labelDrop	2021-10-13 11:23:51 UTC
Github	kubevirt kubevirt pull 6588	None	Merged	Fix recording rules based on up metrics	2021-10-27 09:49:54 UTC
Github	kubevirt kubevirt pull 6683	None	Merged	[release-0.44] Fix issues in up metrics and alerts based on namespace labels	2021-11-08 12:53:30 UTC
Red Hat Product Errata	RHBA-2021:5091	None	None	None	2021-12-13 19:59:17 UTC

Internal Links: 2026431

Description Satyajit Bulage 2021-09-27 13:29:44 UTC

Description of problem:Following recording rules are showing incorrect values: "kubevirt_virt_api_up_total, kubevirt_virt_operator_up_total, kubevirt_virt_handler_up_total, kubevirt_virt_controller_up_total"



Version-Release number of selected component (if applicable):4.9.0


How reproducible:100%


Steps to Reproduce:
1.Execute above metrics
2.
3.

Actual results: Showing incorrect values when execute.


Expected results: Above metrics should show accurate/correct values.


Additional info:

Comment 3 Erkan Erol 2021-09-28 08:28:41 UTC

Once upon a time, I found something similar and opened a Github issue for that but it seems it was ignored and closed in time. See https://github.com/kubevirt/kubevirt/issues/5383

Comment 4 Erkan Erol 2021-09-29 06:41:48 UTC

We decided to fix this bug for 4.9.1

Comment 5 Erkan Erol 2021-10-11 11:28:32 UTC

I found that the issue is more complicated than we thought. 

First of all, as I stated in the github issue above (https://github.com/kubevirt/kubevirt/issues/5383), we use labeldrop configuration for namespace label to be able to keep namespace labels on workload metrics as they are. When workload metrics have correct namespace (the namespace of workload instead of controlplane), non-admin users in OpenShift can see them. This is implemented with this PR in the past: https://github.com/kubevirt/kubevirt/pull/3125  Since that PR, prometheus doesn't add/change namespace labels on kubevirt's metrics and we have correct namespaces (added by our control plane) on workload metrics. As a side effect, we don't have namespace labels on the metrics provided by Prometheus such as "up" and it broke our recording rules and alert definitions depend on those metrics, which has to be fixed.

I talked with monitoring team and they said it is safe to use "honorLabels" now. I did some tests with it and it gives the expected result: All metrics have namespace labels (including "up") and the workload metrics have correct namespace label. I PROPOSED TO REMOVE LABELDROP AND USE HONORLABELS TO FIX THIS ISSUE.


Secondly,  as you can see in the attached screenshot shared by Satya, "up" metric appears twice per pod in the UI&Prometheus. One is UP and the other one is DOWN. This is another issue we need to solve. When I checked the ServiceMonitor objects and Prometheus configuration, I noticed that Prometheus adds all endpoints in controlplane's namespace twice for servicemonitor of kubevirt and cluster-network-addons. Kubevirt's one supports HTTPS, the other one supports HTTP, that is why we observe two record, one is UP and the other is DOWN (Since Prometheus cannot access endpoints with wrong protocol).  I contacted with monitoring team and they found an issue in upstream Prometheus repository. See https://github.com/prometheus-operator/prometheus-operator/issues/4325 Even when we fix the issue above, we will still observe wrong data for our recording rules and alerts. The workaround for that bug is using labels that have non-empty values. Example:  instead of prometheus.kubevirt.io: "", we can use prometheus.kubevirt.io: "true".

Comment 6 Shirly Radco 2021-10-19 18:30:03 UTC

Erkan, Please implement this w/a.

Comment 7 Erkan Erol 2021-10-21 09:21:26 UTC

I opened two PRs as a workaround for prometheus-operator issue.

https://github.com/kubevirt/kubevirt/pull/6652
https://github.com/kubevirt/cluster-network-addons-operator/pull/1053

Comment 8 Ram Lavi 2021-10-25 16:24:42 UTC

CNAO stable branch PR https://github.com/kubevirt/cluster-network-addons-operator/pull/1058

Comment 9 Ram Lavi 2021-10-26 14:40:21 UTC

fix merged on d/s on CNAO branch cnv-4.9-rhel-8 https://code.engineering.redhat.com/gerrit/c/cluster-network-addons-operator/+/285141

Comment 12 Debarati Basu-Nag 2021-11-22 20:31:10 UTC

Verified against 4.9.1, able to query the following recording rules, and they are showing correct values: "kubevirt_virt_api_up_total, kubevirt_virt_operator_up_total, kubevirt_virt_handler_up_total, kubevirt_virt_controller_up_total" associated with the query.

Comment 18 errata-xmlrpc 2021-12-13 19:59:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.9.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5091

Note You need to log in before you can comment on or make changes to this bug.