Created attachment 912865 [details] Image of Network indicator Description of problem: The "Network" indicator under the host tab is not giving a true representation of the percentage of the network in use. I have 2 bonded 1GB interfaces. While I was transferring at a speed of approximately 970Mbps, the indicator showed 97% usage. It should show a little less than 50% if my NICs are bonded in mode 4 Version-Release number of selected component (if applicable): 3.4.1 How reproducible: 100% Steps to Reproduce: 1.Bond 2 or more NICs 2.Start a large data transfer wither by migrating a disk or vi scp from the cli 3.watch the indicator Actual results: Network indicator showing incorrect data Expected results: Show aggregate results from both NIC bond members Additional info:
Created attachment 912940 [details] Caching monitoring stats The engine seems to be caching monitoring stats even across engine restarts
More info from vdsClient -s 0 getVdsStats ksmCpu = 0 ksmPages = 64 ksmState = False memAvailable = 26894 memCommitted = 4161 memFree = 27540 memShared = 750366 memUsed = '86' momStatus = 'active' netConfigDirty = 'False'
I agree with the reporter that the percentage under "usage" is misleading - to me usage percentage should mean total used bandwidth out of total available physical bandwidth. We currently display the utilisation of the most utilised interface - the original reason for this was to emphasize for an administrator that a "bottleneck" is reached. This might be less noticeable if it's only one of 4 interfaces, for example, and then utilisation would show under 25%, although the strain on that specific network would be significant. Nir, I would love to hear your opinion on this - should we show the actual network usage, which would be more accurate but less effective as an alarm for the administrator? Or maybe just change the title of the column?
Maurice, I would love to hear your opinion on this as well.
I think that the best way is to show the actual network usage and reflect the current status of the NICs. We need to think what is the best way to display the information for bond and VLAN devices in the same manner. We already generate an event once a NIC is exceeding the defined threshold, so I guess this could be used to alert on bottlenecks and have the user check/expand the link.
* VLAN devices - I would ignore these, as their speed doesn't mean anything anyway (so there's no reliable way to compute percentage). Any traffic on a VLAN device should also be counted on its underlying bond/interface. * Bonds - we already compute the bond's speed, supposedly correctly, as a function of bonding mode and underlying interfaces' speeds. So I would take a bond into account as an interface, and ignore its underlying interfaces as independent devices. Relying on the threshold event sounds good enough to me.
Just some additional information, performed the calculation under assumption that NICs are operating in full-duplex mode (as "duplexity" isn't currently reported by VDSM, and half-duplex should be uncommon for a virtualization environment).
Dan, would just like to verify what I said in Comment 6 isn't nonsense.
So it seems I misread the original intent of the reporter. For that specific case, only a minor tweak on the engine side is required to disregard network usage of slave interfaces underlying a bond. There's an issue with displaying "true" network usage of a host, since we can't really filter out interfaces that aren't connected to any network (and are in "up" state). This isn't uncommon, and would usually result in a significant lowering of network usage that would be misleading. So for now let's address only the smaller fix.
Comment 6 isn't nonsense, but I'm not sure it is directly related to the reporter's plight. The original issue stems from referring to slave speed (that is saturated) instead of the aggregated bond speed. The total speed of a network should be taken from the underlying bond (if there is one) or nic (if not), or an utter fake value (for nicless network). Note that a slave may be saturated while a bond master is not (e.g. a single TCP connection is hogging a slave), and this condition should probably be reported to the user.
Verified on - oVirt Engine Version: 3.5.0-0.0.master.20140821064931.gitb794d66.el6
oVirt 3.5 has been released and should include the fix for this issue.