Created attachment 1636494 [details] screenshot Description of problem: When a VM is started and then stopped, VM Dashboard Utilization card displays historical data for Memory, Filesystem and Network, but not for CPU. Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-11-02-092336 How reproducible: 100% Steps to Reproduce: 1. Start a VM 2. Stop a VM 3. Actual results: CPU data are not available, other telemetry data remains displayed Expected results: Additional info:
Data points are retrieved from Prometheus. There is recently no suitable CPU metric for VMs, so a pod-metric is reused here, matching is via launcher-pod name. When the VM is off, the laucnher-pod is unknown and no cpu data points are matched. Data points for other metrics are searched by VM name, so they stay available. We can: - disable utilization card when the VM is off - add cpu metric for VMs - document this state and provide suitable tooltip (but still bas user experience, imo) I think the best option is the 2nd one but will take long to implement (cross-team). With the 1st option we will use just the historical data as no new data points are coming if the VM is off. So this sounds like the best option for now.
(In reply to Marek Libra from comment #1) > Data points are retrieved from Prometheus. There is recently no suitable CPU > metric for VMs, so a pod-metric is reused here, matching is via launcher-pod > name. Francesco: I think you were exposing the VM metrics to prometheus. Do you know if there is some place where the missing metrics are being tracked? > > When the VM is off, the laucnher-pod is unknown and no cpu data points are > matched. > Data points for other metrics are searched by VM name, so they stay > available. > > We can: > - disable utilization card when the VM is off > - add cpu metric for VMs > - document this state and provide suitable tooltip (but still bas user > experience, imo) > > I think the best option is the 2nd one but will take long to implement > (cross-team). > With the 1st option we will use just the historical data as no new data > points are coming if the VM is off. So this sounds like the best option for > now.
(In reply to Tomas Jelinek from comment #2) > (In reply to Marek Libra from comment #1) > > Data points are retrieved from Prometheus. There is recently no suitable CPU > > metric for VMs, so a pod-metric is reused here, matching is via launcher-pod > > name. > > Francesco: I think you were exposing the VM metrics to prometheus. Do you > know if there is some place where the missing metrics are being tracked? Indeed I had, but not working on this anymore. sgott knows who is taking care of metrics now. That said, I'm not aware of any tracking re: missing metrics Please also note that AFAIK there are no gaps re: metrics with respect to oVirt.
sgott, can you please suggest suitable metric to track VM CPU utilization? If it is missing, is there a place to track it? Can it be added in 4.3 time-frame?
(In reply to Marek Libra from comment #4) > sgott, can you please suggest suitable metric to track VM CPU utilization? > If it is missing, is there a place to track it? Can it be added in 4.3 > time-frame? Just curious, could you please elaborate why https://github.com/kubevirt/kubevirt/blob/master/pkg/monitoring/vms/prometheus/prometheus.go#L89 isn't good enough? This is supposed to provide raw data from which we can compute the CPU utilization
IIUC, the unit for this query is the "time". But per design, milli-cores/cores are expected to be shown to be aligned with other resource pages in UI (namely pods).
need info flag accidentally removed by previous comment, adding back
Strictly speaking, I'm not sure that it's possible to collect VM CPU utilization outside of how we already approach this--because we can't guarantee that a guest user agent will be present. Thus we cannot provide a CPU metric for VMs in general (as requested in Comment #1). Furthermore, adding a pod metric to a VM and calling it a VM metric is really not a reasonable course of action at the KubeVirt API level. In my opinion "disable utilization card when the VM is off" is really the best long term solution here.
There's additional issue which can be fixed along this original one: Once the VM is stopped, no new datagrams are being created. The last datagram in the graph gets stucked to the one from the time when the VM has been stopped. As the time-axis neither contains labels nor reflects reality, the "value" (Y-axis) never gets to 0 value, as one would intuitively expected. Both issues can be fixed by: - showing "No datapoints found." for an off-VM - dropping (filtering-out) datagrams with timestamps older then the create timestamp of the VirtualMachineInstance object
verified on 4.4.0-0.nightly-2020-02-22-102956.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581