Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 996678

Summary:	[Monitoring] Engine not reporting the data used to generate the Network Load Graph.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Udayendu Sekhar Kar <ukar>
Component:	ovirt-engine-webadmin-portal	Assignee:	Lior Vernia <lvernia>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Meni Yakove <myakove>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.2.0	CC:	acathrow, asegundo, audgiri, bazulay, danken, ecohen, iheim, lpeer, lvernia, masayag, mpavlik, myakove, Rhev-m-bugs, rpai, ukar, yeylon, ylavi
Target Milestone:	---	Keywords:	Triaged
Target Release:	3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	network
Fixed In Version:	ovirt-3.4.0-alpha1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-07-10 13:00:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Network	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	675560, 1007860
Bug Blocks:

Description Udayendu Sekhar Kar 2013-08-13 17:00:41 UTC

Description of problem:
Admin Portal (and REST API) not reporting the data used to generate the network graph. The Network Load Graph in Administrator Portal is not reflecting the biggest value from "txRate" or "rxRate" from all interfaces listed, bond interfaces included. RHEV Administrator Portal only shows Tx and Rx for physical devices.

Version-Release number of selected component (if applicable):
RHEVM 3.2

How reproducible:
100%

Steps to Reproduce:
1. In RHEV 3.2 use 10Gbps network infrastructure.
2. Install the required guest tool to show the %network utilization in the rhevm GUI through webadmin portal.
3. In the rhevm GUI the value of %network utilization is 10% of the original use. If the actual use if 2% then in the gui it will show 20%.

Actual results:
Its not showing the actual utilization of the network.

Expected results:
It should show the actual utilisation of the network in the rhevm GUI.

Comment 1 Udayendu Sekhar Kar 2013-08-13 17:04:48 UTC

Considering the data below:

# grep network vdsClient_-s_0_getVdsStats
	network = {'bond4': {'macAddr': '', 'name': 'bond4', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond0': {'macAddr': '', 'name': 'bond0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '2.1', 'rxRate': '2.0', 'txErrors': '0', 'state': 'up', 'speed': '1000', 'rxDropped': '68602'}, 'bond1': {'macAddr': '', 'name': 'bond1', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond2': {'macAddr': '', 'name': 'bond2', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond3': {'macAddr': '', 'name': 'bond3', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth1': {'macAddr': '', 'name': 'eth1', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'up', 'speed': '10000', 'rxDropped': '68602'}, 'eth0': {'macAddr': '', 'name': 'eth0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.2', 'rxRate': '0.2', 'txErrors': '0', 'state': 'up', 'speed': '10000', 'rxDropped': '0'}}

The Network Load Graph in Administrator Portal is not reflecting the biggest value from "txRate" or "rxRate" from all interfaces listed, bond interfaces included. RHEV Administrator Portal only shows Tx and Rx for physical devices (eth0 and eth1 in this case).

As per the above data, we have:

bond4
- txRate: 0.0
- rxRate: 0.0
bond0
- txRate: 2.1
- rxRate: 2.0
bond1
- txRate: 0.0
- rxRate: 0.0
bond2
- txRate: 0.0
- rxRate: 0.0
bond3
- txRate: 0.0
- rxRate: 0.0
eth1
- txRate: 0.0
- rxRate: 0.0
eth0
- txRate: 0.2
- rxRate: 0.2

And Admin Portal will show:
- Network Load Graph: 2% (reflecting bond0 txRate 2.1, truncated)
- eth0 Tx: < 1  (0.2 is less than 1%, no Mbps reported)
- eth0 Rx: < 1  (0.2 is less than 1%, no Mbps reported)
- eth1 Tx: < 1  (0.0 is less than 1%, no Mbps reported)
- eth1 Rx: < 1  (0.0 is less than 1%, no Mbps reported)

So, the issue here is Admin Portal (and REST API) not reporting the data used to generate the graph.

Comment 2 Amador Pahim 2013-08-13 17:28:53 UTC

Correction:

****
The Network Load Graph in Administrator Portal IS reflecting the biggest value from "txRate" or "rxRate" from all interfaces listed, bond interfaces included.
****

The graph is accurate. But if non-physical devices have txRate/rxRate greater than physical ones, we can't identify the graph source in Admin Portal/REST API.

Comment 3 Moti Asayag 2013-09-29 12:11:06 UTC

(In reply to Amador Pahim from comment #2)
> Correction:
> 
> ****
> The Network Load Graph in Administrator Portal IS reflecting the biggest
> value from "txRate" or "rxRate" from all interfaces listed, bond interfaces
> included.
> ****
> 
> The graph is accurate. But if non-physical devices have txRate/rxRate
> greater than physical ones, we can't identify the graph source in Admin
> Portal/REST API.

The non-physical devices (bridges/vlans) statistics aren't being reported by VDSM. See recently closed Bug 675560 - [vdsm] vdsm should monitor bond interfaces, sub-interfaces and bridges status.

Dan, could we consider reverting the decision not to report it ?

Comment 4 Dan Kenigsberg 2013-09-29 12:33:52 UTC

Sure. If there's a use case, that bug can be re-opened.

Comment 6 Moti Asayag 2014-02-17 21:11:20 UTC

After debugging the code, it seems that there is no need for an extra work from the engine side in order to persist the data and/or to reflect it to the user.

Meni, could you verify it on an environment with a real data traffic ?

Comment 7 Moti Asayag 2014-02-18 15:29:58 UTC

Trying to test this issue end-to-end (vdsm-to-restapi) didn't return the expect result:

All the non-physical devices reported TX/RX rate as 0 in the api.
The reason for it is their speed which isn't reported as part of the getVdsCaps as done for the nics.

As a result, the api refers to that unreported speed as 0, and returns the 0 value for those vlan devices.

Reporting the speed per non-physical devices requires its RFE and demand supporting it both on vdsm and on engine side.

However, the RX/TX values reported by vdsm for non physical interfaces are being stored in the engine database (in vds_interface_statistics table) and also being saved to the data-warehouse tables (dwh_host_interface_history_view).

I'm not sure if it is enough by its own to generate the required reports.
Yaniv could you elaborate about the usage of that table ? Will it be used for presenting the TX/RX of the host, regardless the type of the nic ?

Comment 8 Dan Kenigsberg 2014-02-18 16:22:34 UTC

A proper RFE would be bug 1066570: to dump "speed" altogether, and get everybody report not rxRate, but the actual rx_bytes (or KiB). The "speed" of a nic is a false entity, and it is even more so when it comes to vlan devices.

Until this happens, dwh should not use the static speeds reported by getVdsCaps - these are expected to be out of date. For example, bond devices may change their speeds as slaves come and go.

Comment 11 Dan Kenigsberg 2014-04-10 15:25:35 UTC

Yaniv, the question for you of comment 7 has been cleared by mistake in comment 10.

Is current db content (with speeds set to 0 for virtual devices) enough to generate the network graph?

Reporting speeds for virtual devices is not expected in 3.4, so if current db content (with speeds set to 0 for virtual devices), this issue would have to wait.

Comment 12 Yaniv Lavi 2014-04-10 15:44:35 UTC

(In reply to Moti Asayag from comment #7)
> Yaniv could you elaborate about the usage of that table ? Will it be used
> for presenting the TX/RX of the host, regardless the type of the nic ?

Yes, it is used to show nics usage and expected data need to be reliable. This view is pulled every minute to history db and then reported on to users on host usage in diffrent reports.

Comment 13 Dan Kenigsberg 2014-04-10 17:52:32 UTC

Is current db content (with speeds set to 0 for virtual devices) enough to generate the network graph?

Comment 14 Yaniv Lavi 2014-04-22 08:31:03 UTC

(In reply to Dan Kenigsberg from comment #13)
> Is current db content (with speeds set to 0 for virtual devices) enough to
> generate the network graph?

I can read 0 or null or any other value, but the data will not be correct and that is a problem.

Comment 15 Lior Vernia 2014-06-25 20:15:33 UTC

Dan, the "network graph" that the GUI shows displays percentage, so there's no escaping relying on the reported NIC speed.

However, I don't understand why the percentage of network utilisation is currently calculated as the percentage of the most encumbered NIC - to me, network utilisation of a host means total transfer rate on top of physical NICs divided by the total speed of the physical NICs.

By the above definition, I don't understand why we ever marked this to be blocked on the VDSM reporting traffic on virtual NICs.

Nor does this depend on the mentioned RFE of course, as the total percentage can be calculated as a weighted (by NIC speed) sum of utlisation percentages on all NICs (as I understand, what VDSM reports as rate is percentage?).

Does this make sense so far?

The one thing I'm not sure about is how to consider both Tx and Rx. Since most network cables are duplex nowadays, shouldn't utilisation be calculated as (Rx + Tx) / (2 * Speed)? The reasonable alternative would be to calculate max(Rx/Speed, Tx/speed), which sounds to me more meaningful for users.

Comment 16 Lior Vernia 2014-06-26 06:57:46 UTC

After some further reflection, there is another question with bonds. It MIGHT be better when calculating the total load on a host to take bonding into consideration, i.e. when physical NICs are bonded to calculate the load on the bond (different calculation for different bonding modes) and only include the load on the bond (and not on the underlying NICs) when calculating total load on the host.

The question is what users are looking for - if I bond two NICs in a mode where only one is active, and the one is fully utilised, is that 100% utilisation or 50% utilisation? It's 100% utilisation under the constraints I had put, but 50% when considering the physical resources available.

Comment 17 Dan Kenigsberg 2014-06-26 10:37:48 UTC

Lior, you are asking good questions, but I think that the proper answer is in bug 1066570: we should stop messing with percentages of ill-defined speeds. We should report the actually-consumed bandwidth - per nic, per vnic, per vm and per host - instead.

Beyond that, in each and every moment, we should report the possibly-changing speed estimation of each nic (vdsm would need to report half/full duplex, which it does not monitor today), based on which we can auto-scale our graphs.

Setting the speed of the most encumbered nic as the 100% is just another awkward facet of the problem. It was invented as a hack to ignore vNICs that are defined but are not carrying traffic (with Amos gone, there is no one left to blame).

P.s. Lior, it's not so clear why speed is required to produce the graph. Vdsm (unfortunately) reports percentages, which are stored in Engine DB, which could have been displayed as they are.

Comment 18 Lior Vernia 2014-06-26 10:51:34 UTC

I agree, I didn't mean that necessarily the engine needs to depend upon the speed reported by a NIC, just that "someone" has to refer to "some speed" in order to represent utilisation in percentage.

If we do want to report network load on a host (let's not discuss VMs here, as this bug deals with hosts) as a percentage, then I don't think Bug 1066570 is a pre-requisite to fix this bug.

I agree that there's no reason for the engine to recalculate anything - I would indeed suggest to calculate the load as a weighted some of the percentages reported by VDSM (be they true or not).

What do you think?

Comment 19 Lior Vernia 2014-07-03 16:56:29 UTC

I suggest that for the time being we just calculate load the same way we have been (most encumbered NIC), but only consider physical NICs. To my understanding this would satisfy Udayendu's request that it be possible to retrace the graph back to the statistics displayed in the interfaces subtab.

But before this fix is implemented - Dan, do I understand correctly there is no significance to the speed reported by bonds and VLAN devices? Is it safe to just disregard virtual devices?

Comment 20 Lior Vernia 2014-07-07 12:18:31 UTC

Udayendu,

If I understand the issue correctly, the problem is that bond speeds were being reported incorrectly (1,000 Mbps though the underlying NICs' speed was 10,000), therefore the computed load on the bond seemed to be much larger than it actually was.

Bug 1007860 solved this for 3.4 - the speed of a bond should be correctly related to its underlying NICs. I think this should eliminate the difficulty of retracing the NIC causing the displayed load graph. Do you agree?

Comment 22 Lior Vernia 2014-07-09 06:55:44 UTC

Seems this specific issue should have indeed been solved by Bug 1007860.