Bug 2241904

Summary: Metric cnv:vmi_status_running:count show no datapoint found
Product: Container Native Virtualization (CNV) Reporter: Akriti Gupta <akrgupta>
Component: MetricsAssignee: Assaf Admi <aadmi>
Status: CLOSED DUPLICATE QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13.5CC: aadmi, jvilaca, sradco, stirabos
Target Milestone: ---Flags: akrgupta: needinfo+
akrgupta: needinfo+
Target Release: 4.13.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-10-05 10:53:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cnv:vmi_status_running:count none

Description Akriti Gupta 2023-10-03 10:43:33 UTC
Created attachment 1991815 [details]
cnv:vmi_status_running:count

Description of problem: With vms running on the cluster metric cnv:vmi_status_running:count fail to appear, no values found 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.create vm 
2.check metric cnv:vmi_status_running:count
3.

Actual results:
No Datapoints Found

Expected results:
metric value shows no. of vms running

Additional info:

Comment 1 Assaf Admi 2023-10-04 08:25:48 UTC
Hi, using CNV v4.13.4, I don't encounter this issue. Once created VMs and they started running, it took about ~30 seconds for cnv:vmi_status_running:count to appear with the correct value. Prometheus has a default of 1m for evaluating rules, so the delay makes sense to me. 

Akriti, any chance you evaluated cnv:vmi_status_running:count right after running the first VMs, without waiting long enough? 
If not, assuming you have a cluster with this issue, it would be really useful if you could attach the output of the following command:
"oc get prometheusrule prometheus-k8s-rules-cnv -n openshift-cnv -o yaml"

Comment 2 Assaf Admi 2023-10-04 08:30:14 UTC
Akriti, it would also be useful if you can specify the CNV version you encountered this issue with.

Comment 4 Assaf Admi 2023-10-04 11:11:41 UTC
cnv:vmi_status_running:count recording rule expression is: 
sum(kubevirt_vmi_phase_count{phase="running"}) by (node,os,workload,flavor)

I can now confirm there is an issue with kubevirt_vmi_phase_count metric which is not working at all, and this affects cnv:vmi_status_running:count recording rule expression. 
The issue was probably introduced in https://github.com/kubevirt/kubevirt/pull/10424. First impacted version is v4.13.5.rhel9-20, according to http://cnv-version-explorer.apps.cnv2.engineering.redhat.com/?cPRs=10424.

Joao, any idea what could be the root cause?

Comment 5 Shirly Radco 2023-10-05 08:09:52 UTC
As part of the fix for this bug please add an upstream test to verify that the metric exists and its value is correct.

Comment 6 Assaf Admi 2023-10-05 10:53:59 UTC

*** This bug has been marked as a duplicate of bug 2240675 ***

Comment 7 Red Hat Bugzilla 2024-02-03 04:25:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days