Created attachment 1930043 [details] vCPU number is not correct Description of problem: vCPU number is not correct in Virtualization -> Overview Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Visit Virtualization -> Overview 2. 3. Actual results: Expected results: Additional info:
Hi Guohua, can you, please, specify the reason of the vCPU number not being correct? IMHO it seems to be correct - consistent with the displayed graph. That number you've marked in the attachment just shows the maximum number achieved in the last day (24h?), not the actual number, as you can see in the graph. WDYT? Thanks!
Hi Hilda, Can you explain what the number means there? the number looks too big.
Hi Ronen, Could you drop in and comment in the bug to tell us why do you think the vCPU number is not making sense.
@gouyang I agree, the number is too big. In the screenshot we have 3 VMs, even if they are large, with 16 CPUs each, this number should be 48 vCPU and not over 3,000. @hstastna is is possible this is not the vCPU but millicores?
So I just found that the number is really buggy, but still not sure about the expected result. Phillip is the best person for that, he's gonna explain more and take this bug asap. So let's be patient now.
@rsdeor I apologize. I discovered the root issue after we discussed this bug previously and let it slip through the cracks. The problem is that I didn't realize the disconnect between what was expected in the design and what the metric being used provides. We don't currently have a metric that provides the number of vCPUs in use. We have two vCPU metrics that would work for the charts: kubevirt_vmi_vcpu_seconds and kubevirt_vmi_vcpu_wait_seconds. The vCPU seconds metric is what's used in the metric charts currently. We use the vCPU wait seconds metric in at least two charts in the kubevirt dashboard, so I think it's a good candidate for use in the metric charts card. ------------------------------- kubevirt_vmi_vcpu_seconds [1]: Total amount of time spent in each state by each vcpu. Where id is the vcpu identifier and state can be one of the following: [OFFLINE, RUNNING, BLOCKED]. Type: Counter. kubevirt_vmi_vcpu_wait_seconds [2]: Amount of time spent by each vcpu while waiting on I/O. Type: Counter. [1] https://github.com/kubevirt/kubevirt/blob/main/docs/metrics.md#kubevirt_vmi_vcpu_seconds [2] https://github.com/kubevirt/kubevirt/blob/main/docs/metrics.md#kubevirt_vmi_vcpu_wait_seconds
@sradco can you help us with which metric should be used here?
I believe that the query should be count (kubevirt_vmi_vcpu_seconds{state="running", namespace="<namespace>"})
We may also be able to use count(kubevirt_vmi_vcpu_wait_seconds{namespace="<namespace>"}) We filtered in the above by state, we should check if this is indeed needed. The metrics description can be found here https://github.com/kubevirt/kubevirt/blob/main/docs/metrics.md#kubevirt_vmi_vcpu_seconds.
@rsdeor I'm not sure which metric you'd like to display. Here are the relevant details from the link Shirly provided. kubevirt_vmi_vcpu_seconds Total amount of time spent in each state by each vcpu. Where id is the vcpu identifier and state can be one of the following: [OFFLINE, RUNNING, BLOCKED]. Type: Counter. kubevirt_vmi_vcpu_wait_seconds Amount of time spent by each vcpu while waiting on I/O. Type: Counter. I'm not sure which would be more important to the user. We display the wait_seconds metric in the Top Consumers dashboard, but I'm not aware of any place in the UI where the vcpu_seconds metric is being used. Thoughts?
@phbailey when Shirly and I did a quick test yesterday, count on both metrics provided the same result as for the number of vCPUs. When we used kubevirt_vmi_vcpu_seconds we only looked at state="running" (see comment #9)
@rsdeor Ah, ok. I didn't realize you looked at those together and considered them one and the same. It's odd that they would return the same value since they're supposed to be counting the seconds spent in different states and not the number of vCPUs. I assume I shouldn't update the axis and header labels to indicate a unit of seconds since the metrics don't appear to be returning seconds?
@phbailey we did some very basic test to check this, so please double check. In both cases we got the number of vCPUs, not seconds. Try running the queries in your environment and verify both for the entire cluster and for a namespace
My tests confirmed your result for both cluster and namespace. The PR has already merged and the 4.12 backport has been opened: https://github.com/kubevirt-ui/kubevirt-plugin/pull/1032.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205