Bug 2072821 - Top Consumers of Storage Traffic in Kubevirt Dashboard giving unexpected numbers
Summary: Top Consumers of Storage Traffic in Kubevirt Dashboard giving unexpected numbers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Metrics
Version: 4.10.0
Hardware: x86_64
OS: Linux
high
unspecified
Target Milestone: ---
: 4.12.0
Assignee: Assaf Admi
QA Contact: Ohad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-07 03:05 UTC by Germano Veit Michel
Modified: 2023-01-24 13:37 UTC (History)
7 users (show)

Fixed In Version: 4.12.0-568
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-24 13:36:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-17463 0 None None None 2022-11-22 19:42:16 UTC
Red Hat Product Errata RHSA-2023:0408 0 None None None 2023-01-24 13:37:30 UTC

Description Germano Veit Michel 2022-04-07 03:05:33 UTC
Description of problem:

This is in Observe -> Dashboards -> Kubevirt -> Top Consumers of Storage Traffic

Version-Release number of selected component (if applicable):
4.10.6 + CNV 4.10.0

How reproducible:
Always

Steps to Reproduce:
1. Create a few VMs, make them boot
2. Inside the VM, generate some traffic
   # dd if=/dev/urandom of=/home/testfile bs=1M count=3000 oflag=direct status=progress
2. Go to Observe -> Dashboards -> Kubevirt -> Top Consumers of Storage Traffic and monitor

Actual results:
* Numbers seem off, much lower than expected.
* Some VMs seem to be missing

Expected results:
* Actual VM storage IO

Comment 2 Krzysztof Majcher 2022-04-14 08:28:12 UTC
Assaf, please assess if this can be easily addressed.

Comment 3 Assaf Admi 2022-07-28 10:13:31 UTC
Hi,

Top Consumers of Storage Traffic in Kubevirt Dashboard uses 4h as time range in its query:

sort_desc(topk(5, sum(rate(kubevirt_vmi_storage_read_traffic_bytes_total[4h]) + rate(kubevirt_vmi_storage_write_traffic_bytes_total[4h])) by (namespace, name)))>0

While the dashboard on the bottom of virtualiztion/overview uses 5m as time range in its query:

sort_desc(topk(5, sum(rate(kubevirt_vmi_storage_read_traffic_bytes_total[5m]) + rate(kubevirt_vmi_storage_write_traffic_bytes_total[5m])) by (namespace, name))) > 0


I believe this is the reason that after generating some traffic, the numbers on Top Consumers of Storage Traffic were low. Its query does rate over a much longer time range. When I modified its time range to 5m, I saw identical values.

Comment 4 Germano Veit Michel 2022-08-02 00:27:46 UTC
Hi Assaf,

Same here, if I set to 5m it does look correct. Should it just be changed to 5m then? Averaging over 4h does not make much sense if the X axis is not in 4h increments by default.

Comment 5 Shirly Radco 2022-08-02 13:21:10 UTC
The idea with 4 hours in the table panels dashboard is to get the top 5 VMs that consume the most storage resources during this time period.
In the UI it is possible to change this by choosing a different "Period" from the drop down list.
For the line charts we do use 5m. 
We can set the default period to 5m instead of 4h, but when I discussed this with Ronen Sde-Or we thought it would make more sense to check for a longer time.

Comment 6 Germano Veit Michel 2022-08-02 21:14:51 UTC
Right, I see the point of using an average of 4h for that purpose.
However, the way the information is presented is not clear.

Perhaps if the Dashboard renamed the columns based on the period selected it would be clearer?

For example, perhaps something like this?

Top Consumers of Storage Traffic

Namespace         Virtual Machine       Average Storage Traffic Usage Over {period}
openshift-cnv     rhel8                 7.14 KiB

It's hard for the user to know what is being averaged and what is instant without inspecting the query.

Comment 7 Ronen 2022-08-04 14:22:42 UTC
I agree with Germano, we need to make sure the user understand the information.

Comment 8 Shirly Radco 2022-08-04 16:01:53 UTC
Unfortunately there is no support for dynamic headers in the OCP UI.
I did ask to add support for panel description, but I don't see that it was implemented yet.

Should we change the default to 5m like the line charts so that they are aligned for now?

Comment 9 Ronen 2022-08-04 16:08:03 UTC
Shirly, Is there a limit on the timeframe we can show if we'll use 5m instead of 4h?
If there is no limit so let's modify it to 5m so it will align.

Comment 10 Shirly Radco 2022-08-16 15:24:59 UTC
We should align the default period to be 5m an add to the tables a suffix that explains that the data is calculated based on the selected period.

Comment 11 Assaf Admi 2022-09-28 08:29:16 UTC
It was decided to align both Top-Consumers dashboard and Virtualization/Overview dashboard with a 30 minutes time range, and it was implemented in the following PRs:
https://github.com/kubevirt-ui/kubevirt-plugin/pull/885
https://github.com/kubevirt/monitoring/pull/94

Comment 12 SATHEESARAN 2022-10-18 08:07:39 UTC
Tested with CNV-4.11.1-20, dashboard under 'Observe' -> Dashboards -> Kubevirt -> Top Consumers of Storage Traffic,
still shows the 'period' dropdown.

Then confirmed with Assaf & Oren, that the fix is not yet available in CNV 4.11.1
Moving this bug back to ASSIGNED state

Comment 13 SATHEESARAN 2022-10-18 09:51:50 UTC
The fix was not available with 4.11.1-20 and so clearing the FIXED-IN-VERSION field.
The bug is now retargeted for 4.12 as the original fix is available in upstream master.

Comment 15 Ohad 2022-12-01 15:39:42 UTC
Tried on CNV-v4.12.0-745 and still get the same bug.

Comment 16 Assaf Admi 2022-12-07 10:35:15 UTC
I tried on CNV 4.12.0, and it seems to me that the bug was fixed. 
Virtualization -> Overview -> Top consumers -> Storage throughput shows a very similar numbers compering to KubeVirt / Infrastructure Resources / Top Consumers dashboard (Top Consumers of Storage Traffic graph). 
In addition, I could see in the cluster the changes that were done in the PRs that fixed the issue:
- I was able to see what https://github.com/kubevirt/monitoring/pull/94 changed in KubeVirt / Infrastructure Resources / Top Consumers dashboard
- I was able to see what https://github.com/kubevirt-ui/kubevirt-plugin/pull/885 changed in Virtualization -> Overview -> Top consumers. 

Ohad, can you please share your steps you did for verifying this bug?

Comment 17 Krzysztof Majcher 2022-12-12 15:10:17 UTC
moving to 4.12.1 as I doubt this is blocker bug fo 4.12.0

Comment 18 Ohad 2022-12-13 11:40:15 UTC
Tested now on CNV 4.12, the bug fixed

Comment 22 errata-xmlrpc 2023-01-24 13:36:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408


Note You need to log in before you can comment on or make changes to this bug.