1108144 – [RFE][Monitoring] Improve VM network usage report accuracy

Bug 1108144 - [RFE][Monitoring] Improve VM network usage report accuracy

Summary: [RFE][Monitoring] Improve VM network usage report accuracy

Keywords:
Status:	CLOSED DUPLICATE of bug 1349309
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine-dwh
Sub Component:
Version:	3.4.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Shirly Radco
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-11 13:09 UTC by Amador Pahim
Modified:	2019-04-28 09:25 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-03 15:49:42 UTC
oVirt Team:	Metrics
Target Upstream Version:
Embargoed:
Flags:	sherold: Triaged+

Attachments	(Terms of Use)
network usage stats propagation (22.71 KB, image/png) 2014-06-11 13:10 UTC, Amador Pahim	no flags	Details
View All

Description Amador Pahim 2014-06-11 13:09:39 UTC

The VM network usage information in Admin. Portal has some gaps, causing the information to be outdated/inaccurate:

- Every 5 seconds, vdsm will calculate the percent of network used (considering vnic speed) within that period.

- RHEV backend will consult vdsm every 15 seconds, getting the current vdsm value for network use, updating the database.
** Here the first gap: Backend will skip 2/3 of the vdsm checks. And the data is not cumulative.

- Admin Portal will update the Network usage bar with the database information every 5 seconds(default) if browser has the focus OR every 60 seconds if it hasn't the focus.
** Here the second gap: Admin. Portal, if in focus, will get 3 times the same information from the database, while the actual information has already changed on the host.
** And also the third gap: Admin. Portal, if not in focus, will take 60 seconds to check the database information. The worst case is Admin. Portal exposing an 80 seconds old information (vdsm[5] + backend[15] + Admin Portal[60]).

Notice, in attached image, Admin. Portal (w/o focus) showing always 0%, even with the high network load during a certain period.

Please consider to have the total of transferred bytes (and its timestamp), reported by vdsm and recorded to the database, and to change frontend/restapi to calculate the network usage percent between its own checks using the new database information.

Comment 1 Amador Pahim 2014-06-11 13:10:34 UTC

Created attachment 907672 [details]
network usage stats propagation

Comment 2 Dan Kenigsberg 2014-06-12 11:40:08 UTC

Amador, I do not believe that we can increase the vdsm polling rate - we may choke the line and Engine itself. It does make sense to lower the Vdsm period so that it does not poll data that is bound to be dropped.

Comment 3 Amador Pahim 2014-06-12 11:51:02 UTC

(In reply to Dan Kenigsberg from comment #2)
> Amador, I do not believe that we can increase the vdsm polling rate - we may
> choke the line and Engine itself. It does make sense to lower the Vdsm
> period so that it does not poll data that is bound to be dropped.

Dan, I agree. I'm not intended to increase vdsm pooling rate. I think vdsm should report the total of transmitted/received bytes and let engine calculate the percent by itself. This way, regardless the engine pooling rate, it will be always able to represent all the data at some point.

Comment 5 Lior Vernia 2014-08-13 09:07:27 UTC

Hi Amador,

I am still not clear on the issue. Two concrete questions:

1. What do you mean by the admin portal not being in focus being updated once every 60 seconds? I think this is related either to your browser or to your OS, I don't think oVirt is aware in any way whether the focus is on its dialog.

2. What is the basic problem here? To my understanding, what you described means that the engine will update the data once every 15 seconds, and therefore reported data could be 15-second old. Is this the problem? Why does it matter, in the context of this bug, whether the engine calculates the rate or displays what's reported by vdsm?

Comment 6 Amador Pahim 2014-08-13 13:09:37 UTC

(In reply to Lior Vernia from comment #5)
> Hi Amador,
> 
> I am still not clear on the issue. Two concrete questions:
> 
> 1. What do you mean by the admin portal not being in focus being updated
> once every 60 seconds? I think this is related either to your browser or to
> your OS, I don't think oVirt is aware in any way whether the focus is on its
> dialog.

Makes sense. But this is not the issue here.

> 
> 2. What is the basic problem here? To my understanding, what you described
> means that the engine will update the data once every 15 seconds, and
> therefore reported data could be 15-second old. Is this the problem? Why
> does it matter, in the context of this bug, whether the engine calculates
> the rate or displays what's reported by vdsm?

The main issue is the high possibility (2/3) of a network traffic not reported anywhere. vdsm is calculating the network load average every 5 seconds. Since backend is collecting this information every 15 seconds, 2 in 3 vdsm checks are simply dropped. 

I see 3 possible solutions:
#1 Change vdsm to check the network load average every 15 seconds.
#2 Change backend to collect from vdsm every 5 seconds.
#3 Report total total_bytes and timestamp and let frontend calculate the average by itself.

#3 is the most accurate, since #1 and #2 will only transfer the "drops" from vdsm<->backend to backend<->frontend (whenever the interval between frontend checks is greater than 5 seconds for #1 and 15 seconds for #2.)
With #3, we can have the average calculated by frontend, between its checks.

Comment 7 Dan Kenigsberg 2014-09-03 08:31:21 UTC

Amador, we have already performed your option #1 in http://gerrit.ovirt.org/21257 which is part of ovirt-3.4.

Could you suggest the customer to set vm_sample_net_interval=15 in his /etc/vdsm/vdsm.conf (and restart vdsm) to verify if it fixes his issue?

Comment 8 Leo Liu 2014-09-05 00:09:23 UTC

(In reply to Dan Kenigsberg from comment #7)
> Amador, we have already performed your option #1 in
> http://gerrit.ovirt.org/21257 which is part of ovirt-3.4.
> 
> Could you suggest the customer to set vm_sample_net_interval=15 in his
> /etc/vdsm/vdsm.conf (and restart vdsm) to verify if it fixes his issue?

Hello Dan,

Thanks for your update. 

I have instructed the customer to try but the issue is still there, please see the feedback from customer:

The results of this test are unsatisfactory. I find that if I stream a 
large (5GB) file I do indeed now see some traffic showing up in the 
console display. However, for "real world" usage it still shows 0%.

Investigating further I find that the streaming test shows results for 
all hypervisors, regardless of whether or not the vm_sample_net_interval 
variable has been set. I will mention however that all the hypervisors 
have been upgraded since the last time I tried the same test, so 
something has been improved in more recent versions.

Asking the blindingly obvious: The CPU and memory displays show real 
data. Why is the same mechanism used to obtain that data not being used 
for the network interface?

What is your suggestion? Thanks!

Comment 9 Lior Vernia 2014-09-15 08:34:38 UTC

Hello Liu,

It's not actually the same mechanism for each of these measures. But it doesn't really matter, as the one used for network usage should work independently of the others :)

The difference between "real world" usage and large file streaming leads me to suspect that perhaps the speed of the VM interface is defined to be quite large, so that "real world" usage turns out to be negligible when computing percentage. Could you please corroborate/refute?

Could you also attach the output of running either "vdsClient -s 0 getAllVmStats" or "vdsClient -s 0 GetVdsStats" while the VM has some "real world" usage, but the engine shows nothing? This is to see if vdsm is indeed reporting some traffic that the engine is missing.

Lior.

Comment 10 Leo Liu 2014-10-15 00:04:20 UTC

(In reply to Lior Vernia from comment #9)
> Hello Liu,
> 
> It's not actually the same mechanism for each of these measures. But it
> doesn't really matter, as the one used for network usage should work
> independently of the others :)
> 
> The difference between "real world" usage and large file streaming leads me
> to suspect that perhaps the speed of the VM interface is defined to be quite
> large, so that "real world" usage turns out to be negligible when computing
> percentage. Could you please corroborate/refute?
> 
> Could you also attach the output of running either "vdsClient -s 0
> getAllVmStats" or "vdsClient -s 0 GetVdsStats" while the VM has some "real
> world" usage, but the engine shows nothing? This is to see if vdsm is indeed
> reporting some traffic that the engine is missing.
> 
> Lior.

Hello Lior,

Thanks for your update.

I have asked the same and got feedback from customer:

The speed of the interface is set to 1Gb. This is hardly "quite large". 
The reality of the matter is that the VM displays 0% but when it has 
significant traffic the corresponding hypervisor, which has a 10Gb 
interface, *does* display traffic.

The same issue has been asked about on the Ovirt Users mailing list, so 
it's certainly not a problem unique to us.

Running your command may be great in theory but fails in practice. I am 
not about to monitor network traffic on the VMs in the hope of getting 
the timing for such a command just right.

Your opening sentence bothers me. Why isn't the same mechanism used? The 
other displays work as expected. The network display, using a different 
mechanism, doesn't work correctly. Surely that's a clue pointing to a 
solution?

An update on this one:

We have just created a new VM, which is functioning as a log collector. 
In that role there is very significant network traffic initially. I was 
able to capture the attached screenshots, which very clearly show that 
the network traffic shown for the VM is nothing like that shown for the 
host. Both have a 1Gb interface.

Although the traffic shown for the host was considerably higher seconds 
earlier, it had dropped down to only 10% by the time I captured the 
screenshot, which was taken just after capturing the one for the VM.

Please note that the VM in question is at the only one on that host, so 
all traffic, other that the completely insignificant management 
overhead, is generated through the VM.

I think the fundamental issue is CPU and memory displays show real 
data. Why is the same mechanism used to obtain that data not being used 
for the network interface?

Please look into the issue and help it move forward. Thanks!

Leo

Comment 11 Dan Kenigsberg 2014-10-22 11:29:27 UTC

Pleaes help us further debug the issue by supplying the output of

  vdsClient -s getVdsStats

repeatedly, while the VM generates traffic. I'd like to see if Vdsm fails to report the vnic-specific data

Comment 19 Yaniv Kaul 2015-11-17 13:16:50 UTC

1. I'm still unsure why this is on DWH?
2. I'm still unsure why the customer saw '0' in his tests. Makes little sense to me - unless there was really low BW.
3. As for missing 2/3 of the stats - I agree we need to think of a solution here, as we clearly lose some granularity - that may or may not be required. If it isn't, let's move to 15 seconds on the VDSM. If it is, I suggest either VDSM averages the last 3 samples (of 5 seconds each) or send 3 samples back to the engine every 15 seconds.

Either way, not sure where the bug is. Dan - can you look into this?

Comment 20 Dan Kenigsberg 2015-11-17 16:10:21 UTC

We (Lior, Yaniv, Amador, me) spent quite a lot of time trying to fully understand this issue, and I am not sure we have.

But please note that with bug 1066570 verified in rhev-3.6 we have a much more reliable API. With cumulative report, a missed sample only means lost granularity.

I believe that DWH should be modified to consume the new API and compute the tx/rx rate as a derivative of the cumulative data.

Comment 21 Shirly Radco 2015-11-18 07:44:39 UTC

We already added the Collection of total RX/TX byte statistics to dwh and reports in 3.6.

https://bugzilla.redhat.com/show_bug.cgi?id=1215587
https://bugzilla.redhat.com/show_bug.cgi?id=1228991

In reports we added the fields to the ad hoc domain and the user can create a calculated column based on the fields to calculate the amount of network used. 

So this might be duplicate.
Can we close it or is there anything else missing?

Comment 22 Yaniv Lavi 2015-11-19 15:54:01 UTC

(In reply to Shirly Radco from comment #21)
> We already added the Collection of total RX/TX byte statistics to dwh and
> reports in 3.6.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1215587
> https://bugzilla.redhat.com/show_bug.cgi?id=1228991
> 
> In reports we added the fields to the ad hoc domain and the user can create
> a calculated column based on the fields to calculate the amount of network
> used. 
> 
> So this might be duplicate.
> Can we close it or is there anything else missing?

This is about making the collection more accurate and we will be looking to resolve this by provide a metrics store in RHEV 4.

Comment 23 Yaniv Lavi 2016-08-03 15:49:42 UTC


*** This bug has been marked as a duplicate of bug 1349309 ***

Note You need to log in before you can comment on or make changes to this bug.