1114085 – [Monitoring] Network usage indicator

Bug 1114085 - [Monitoring] Network usage indicator

Summary: [Monitoring] Network usage indicator

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-webadmin
Sub Component:
Version:	3.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Lior Vernia
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:	network
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-27 16:53 UTC by Maurice James
Modified:	2016-02-10 19:36 UTC (History)
CC List:	15 users (show)
Fixed In Version:	ovirt-3.5.0_rc1.1
Clone Of:
Environment:
Last Closed:	2014-10-17 12:43:39 UTC
oVirt Team:	Network
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Image of Network indicator (19.42 KB, image/png) 2014-06-27 16:53 UTC, Maurice James	no flags	Details
Caching monitoring stats (77.83 KB, image/png) 2014-06-27 19:38 UTC, Maurice James	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	31000	0	master	MERGED	engine: Treat bonds and VLANs properly in host network usage	Never
oVirt gerrit	31506	0	ovirt-engine-3.5	MERGED	engine: Treat bonds and VLANs properly in host network usage	Never

Description Maurice James 2014-06-27 16:53:36 UTC

Created attachment 912865 [details]
Image of Network indicator

Description of problem:
The "Network" indicator under the host tab is not giving a true representation of the percentage of the network in use. I have 2 bonded 1GB interfaces. While I was transferring at a speed of approximately 970Mbps, the indicator showed 97% usage. It should show a little less than 50% if my NICs are bonded in mode 4

Version-Release number of selected component (if applicable):
3.4.1

How reproducible:
100%

Steps to Reproduce:
1.Bond 2 or more NICs
2.Start a large data transfer wither by migrating a disk or vi scp from the cli
3.watch the indicator

Actual results:
Network indicator showing incorrect data

Expected results:
Show aggregate results from both NIC bond members

Additional info:

Comment 1 Maurice James 2014-06-27 19:38:26 UTC

Created attachment 912940 [details]
Caching monitoring stats

The engine seems to be caching monitoring stats even across engine restarts

Comment 2 Maurice James 2014-06-27 19:47:32 UTC

More info from vdsClient -s 0 getVdsStats


ksmCpu = 0
        ksmPages = 64
        ksmState = False
        memAvailable = 26894
        memCommitted = 4161
        memFree = 27540
        memShared = 750366
        memUsed = '86'
        momStatus = 'active'
        netConfigDirty = 'False'

Comment 3 Lior Vernia 2014-07-30 14:03:11 UTC

I agree with the reporter that the percentage under "usage" is misleading - to me usage percentage should mean total used bandwidth out of total available physical bandwidth.

We currently display the utilisation of the most utilised interface - the original reason for this was to emphasize for an administrator that a "bottleneck" is reached. This might be less noticeable if it's only one of 4 interfaces, for example, and then utilisation would show under 25%, although the strain on that specific network would be significant.

Nir, I would love to hear your opinion on this - should we show the actual network usage, which would be more accurate but less effective as an alarm for the administrator? Or maybe just change the title of the column?

Comment 4 Lior Vernia 2014-07-30 14:03:37 UTC

Maurice, I would love to hear your opinion on this as well.

Comment 5 Nir Yechiel 2014-07-30 14:30:29 UTC

I think that the best way is to show the actual network usage and reflect the current status of the NICs. We need to think what is the best way to display the information for bond and VLAN devices in the same manner.

We already generate an event once a NIC is exceeding the defined threshold, so I guess this could be used to alert on bottlenecks and have the user check/expand the link.

Comment 6 Lior Vernia 2014-07-30 14:57:59 UTC

* VLAN devices - I would ignore these, as their speed doesn't mean anything anyway (so there's no reliable way to compute percentage). Any traffic on a VLAN device should also be counted on its underlying bond/interface.

* Bonds - we already compute the bond's speed, supposedly correctly, as a function of bonding mode and underlying interfaces' speeds. So I would take a bond into account as an interface, and ignore its underlying interfaces as independent devices.

Relying on the threshold event sounds good enough to me.

Comment 7 Lior Vernia 2014-08-04 09:11:07 UTC

Just some additional information, performed the calculation under assumption that NICs are operating in full-duplex mode (as "duplexity" isn't currently reported by VDSM, and half-duplex should be uncommon for a virtualization environment).

Comment 8 Lior Vernia 2014-08-05 07:14:47 UTC

Dan, would just like to verify what I said in Comment 6 isn't nonsense.

Comment 9 Lior Vernia 2014-08-06 13:28:12 UTC

So it seems I misread the original intent of the reporter. For that specific case, only a minor tweak on the engine side is required to disregard network usage of slave interfaces underlying a bond.

There's an issue with displaying "true" network usage of a host, since we can't really filter out interfaces that aren't connected to any network (and are in "up" state). This isn't uncommon, and would usually result in a significant lowering of network usage that would be misleading. So for now let's address only the smaller fix.

Comment 10 Dan Kenigsberg 2014-08-07 09:01:02 UTC

Comment 6 isn't nonsense, but I'm not sure it is directly related to the reporter's plight. The original issue stems from referring to slave speed (that is saturated) instead of the aggregated bond speed. The total speed of a network should be taken from the underlying bond (if there is one) or nic (if not), or an utter fake value (for nicless network).

Note that a slave may be saturated while a bond master is not (e.g. a single TCP connection is hogging a slave), and this condition should probably be reported to the user.

Comment 11 Michael Burman 2014-08-24 11:34:32 UTC

Verified on -  oVirt Engine Version: 3.5.0-0.0.master.20140821064931.gitb794d66.el6

Comment 12 Sandro Bonazzola 2014-10-17 12:43:39 UTC

oVirt 3.5 has been released and should include the fix for this issue.

Note You need to log in before you can comment on or make changes to this bug.