Description of problem:Following recording rules are showing incorrect values: "kubevirt_virt_api_up_total, kubevirt_virt_operator_up_total, kubevirt_virt_handler_up_total, kubevirt_virt_controller_up_total" Version-Release number of selected component (if applicable):4.9.0 How reproducible:100% Steps to Reproduce: 1.Execute above metrics 2. 3. Actual results: Showing incorrect values when execute. Expected results: Above metrics should show accurate/correct values. Additional info:
Once upon a time, I found something similar and opened a Github issue for that but it seems it was ignored and closed in time. See https://github.com/kubevirt/kubevirt/issues/5383
We decided to fix this bug for 4.9.1
I found that the issue is more complicated than we thought. First of all, as I stated in the github issue above (https://github.com/kubevirt/kubevirt/issues/5383), we use labeldrop configuration for namespace label to be able to keep namespace labels on workload metrics as they are. When workload metrics have correct namespace (the namespace of workload instead of controlplane), non-admin users in OpenShift can see them. This is implemented with this PR in the past: https://github.com/kubevirt/kubevirt/pull/3125 Since that PR, prometheus doesn't add/change namespace labels on kubevirt's metrics and we have correct namespaces (added by our control plane) on workload metrics. As a side effect, we don't have namespace labels on the metrics provided by Prometheus such as "up" and it broke our recording rules and alert definitions depend on those metrics, which has to be fixed. I talked with monitoring team and they said it is safe to use "honorLabels" now. I did some tests with it and it gives the expected result: All metrics have namespace labels (including "up") and the workload metrics have correct namespace label. I PROPOSED TO REMOVE LABELDROP AND USE HONORLABELS TO FIX THIS ISSUE. Secondly, as you can see in the attached screenshot shared by Satya, "up" metric appears twice per pod in the UI&Prometheus. One is UP and the other one is DOWN. This is another issue we need to solve. When I checked the ServiceMonitor objects and Prometheus configuration, I noticed that Prometheus adds all endpoints in controlplane's namespace twice for servicemonitor of kubevirt and cluster-network-addons. Kubevirt's one supports HTTPS, the other one supports HTTP, that is why we observe two record, one is UP and the other is DOWN (Since Prometheus cannot access endpoints with wrong protocol). I contacted with monitoring team and they found an issue in upstream Prometheus repository. See https://github.com/prometheus-operator/prometheus-operator/issues/4325 Even when we fix the issue above, we will still observe wrong data for our recording rules and alerts. The workaround for that bug is using labels that have non-empty values. Example: instead of prometheus.kubevirt.io: "", we can use prometheus.kubevirt.io: "true".
Erkan, Please implement this w/a.
I opened two PRs as a workaround for prometheus-operator issue. https://github.com/kubevirt/kubevirt/pull/6652 https://github.com/kubevirt/cluster-network-addons-operator/pull/1053
CNAO stable branch PR https://github.com/kubevirt/cluster-network-addons-operator/pull/1058
fix merged on d/s on CNAO branch cnv-4.9-rhel-8 https://code.engineering.redhat.com/gerrit/c/cluster-network-addons-operator/+/285141
Verified against 4.9.1, able to query the following recording rules, and they are showing correct values: "kubevirt_virt_api_up_total, kubevirt_virt_operator_up_total, kubevirt_virt_handler_up_total, kubevirt_virt_controller_up_total" associated with the query.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 4.9.1 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:5091