Bug 980363 - Incorrect speed is displayed for the interface of the host
Summary: Incorrect speed is displayed for the interface of the host
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.2.0
: ---
Assignee: Edward Haas
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On: 1240719
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-02 07:35 UTC by GenadiC
Modified: 2019-05-16 13:08 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1240719 (view as bug list)
Environment:
Last Closed: 2018-05-15 17:36:09 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
eth3 shows 0 speed while it is supposed to be 1000 (1.79 MB, image/png)
2013-07-02 07:35 UTC, GenadiC
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:1488 0 None None None 2018-05-15 17:38:11 UTC
oVirt gerrit 28913 0 master ABANDONED engine: move speed property from VdsNetworkInterface to VdsNetworkStatistics Never
oVirt gerrit 82140 0 master MERGED net: Wait for link-up while query netinfo nic speed 2017-10-19 16:15:38 UTC
oVirt gerrit 82199 0 master MERGED net: Extract nic speed handling to link.nic module 2017-10-02 20:41:38 UTC

Description GenadiC 2013-07-02 07:35:00 UTC
Created attachment 767656 [details]
eth3 shows 0 speed while it is supposed to be 1000

Description of problem:
After attaching network to the Host NIC the speed of 0 or 1 is displayed in the GUI and is shown id DB. 
getVdsCaps and getVdsStats report the correct speed of 1000

Version-Release number of selected component (if applicable):


How reproducible:
Sometime

Steps to Reproduce:
1. Create network in the Cluster
2. Attach network to the host with SetupNetworks
3.

Actual results:
 The speed of interface is changed from 1000 to 0 or 1 in DB and GUI

Expected results:
The speed should remain 1000

Additional info:
Output of VdsStats for eth3 
'eth3': {'macAddr': '',
                            'name': 'eth3',
                            'rxDropped': '0',
                            'rxErrors': '0',
                            'rxRate': '0.0',
                            'speed': '1000',
                            'state': 'up',
                            'txDropped': '0',
                            'txErrors': '0',
                            'txRate': '0.0'}}

Comment 1 Dan Kenigsberg 2013-07-04 14:40:07 UTC
Could you attach relevant engine.log, or provide a hint on how this can be reproduced?

What happens when you refresh the host network info (now that we have a button for that)?

Comment 2 GenadiC 2013-07-08 10:26:01 UTC
There is no info in the engine log.
In the DB we see that the speed is incorrect (0 or 1) and getVdsStats and getVdsCaps show the correct speed of 1000.
Refresh doesn't solve the issue as well

Comment 5 Moti Asayag 2013-09-09 08:04:44 UTC
In ovirt-engine-3.3 we introduced a new action on host level named 'refresh capabilities'. It refreshes the host capabilities and the actual network configuration.

By looking at Genadi's host, it appears that the immediate getCapabilities after the setupNetworks command reports the speed as '0', why few seconds later it is reported as '1000'. This is another issue that can be looked further on VDSM side on how the speed is being evaluated.

Using the 'refresh capabilities' manually updates the speed of the host nics correctly, and with Bug 999947, it will be done automatically.

Comment 6 Assaf Muller 2013-10-02 08:41:15 UTC
The bug is because when (For example) deleting a network from a nic VDSM calls ifdown, removes the network configuration from the nic and then calls ifup. When ifup returns it's done VDSM reports that the setupNetworks command has finished and the engine sends a getVdsCaps.

The issue is that by time the getVdsCaps reaches VDSM, the kernel still hasn't managed to bring the device up (Even though ifup reported it's done), and when a device is down the speed is reported as 0.

In other words it's raceful.

Specifically for the reporting of speed we can use alternative methods to report it (Currently we look at /sys/class/net/$device/speed, which is inaccessible when the device is down) however ethtool (For example) has other issues.

To actually solve the issue (Which is on a larger scale than just the reporting of speed) we'd have, as far as I see, to make pretty drastic infrastructure changes. For example, to add the ability for VDSM to send information to the engine. I suppose such a change will be considered for the next major release. For 3.4 the engine can poll VDSM every T seconds as Moti has already suggested.

Comment 7 Jamie Bainbridge 2013-10-22 00:56:38 UTC
(In reply to Assaf Muller from comment #6)
> The bug is because...
> I suppose such a change will be considered for the next major release.

Ah, that's excellent, thank you very much for the root cause description.

Could we please log a new bug to make sure we investigate this race within the next major release? I didn't see a 4.0 in Version or Target so I'm not sure how to track that properly.

We should then be able to set this bug to CLOSED CURRENTRELEASE with the refresh fix provided by Bug 999947.

Comment 8 Assaf Muller 2014-02-19 16:43:50 UTC
Instead of closing this bug and opening another I moved this bug's component to VDSM with a possible solution. Dan, Toni, if you dislike this solution I think an alternative VDSM solution is the most feasible approach. If we fail to find a VDSM solution then we can close this bug and open another one with some other component and target release.

Here's a VDSM solution:

In VDSM's ifup function, after the ifup script returns, perform a 'wait' on the NIC. More specifically, poll for its state in some busy loop with a reasonable timeout, and only after it's up (We know it should go up because we just ifup'd it and ifup returned successfully), exit the function. It will lengthen VDSM's ifup execution time by, for example, a second; But it will ensure that VDSM's ifup (And thus setupNetworks) will return only when the NIC is actually up, and so the first getVdsCaps after setupNetworks finishes will have an UP NIC to sample.

Comment 9 Antoni Segura Puimedon 2014-02-19 16:58:19 UTC
The only issue is that we have async ifupping, i.e., when the engine requests dhcp ip configuration for a network we delegate the interface upping to another thread and we return, so that the UI is not just waiting. IMHO it would make sense that whatever the engine does in the dhcp case, it would do as well for the general case.

Comment 10 Assaf Muller 2014-02-19 17:12:51 UTC
When you do an ifup, does it return only after it gets a DHCP response, or after it sends the command to bring the NIC up, but before the DHCP response is arrived? I think that if ifup returns before it receives a DHCP response, then we can switch over to do synchronous ifups, add the loop that waits until the NIC is up, then return.

Comment 12 Antoni Segura Puimedon 2014-02-19 17:32:43 UTC
We moved to async for ifup, i.e., spawn a thread that will call dhcp and immediately return (so probably even before dhclient has been called) so that the UI would be more responsive.

Moving to all async and waiting is a possibility, of course, but that would undo the change vdsm was requested to make for UI responsiveness. All in all, the best would be that setupNetworks were made an asynchronous task and that vdsm would mark it as done after the kernel effectively put up the device/s.

Comment 13 Dan Kenigsberg 2014-02-19 21:46:40 UTC
There's nothing much we can do within Vdsm if this is a dhcp-defined network. Engine may request setupNetworks to wait until the new device is up with blockingdhcp=True, or poll getVdsCaps until the changes takes effect (or a timeout expires).

I've heard reports that even with blockingdhcp=True, there's a race within the ifup script, which may return before the ifup'ed device is completely usable. In my opinion, such an issue should be fixed within initscripts, iproute2, or the kernel, though we could hack it with a tight loop described in commit 8.

Jaimie, have you seen the issue happen with static addresses?
Can you reproduce this without Vdsm, i.e. by writing and ifcfg, calling ifup, and immediately reading /sys/class/net/DEVICE/speed ?

Comment 14 Jamie Bainbridge 2014-02-19 23:33:22 UTC
I am not sure if this is limited to static or DHCP networks.

However, when calling ifup on a physical interface, it does take some time for /sys/class/net/DEVICE/speed to read the proper speed. On the interface I tried, the physical interface speed remains at "-1" for a second or two, before going to "1000". I'd assume this is due to link speed negotiation between the NIC and the switchport.

Comment 15 Moti Asayag 2014-02-20 21:24:15 UTC
This bug could be fixed on the engine side by using the speed reported now (since 3.4) for each device, therefore as part of the host's statistics collecting, the engine should preserve the speed for the device.

Moving the bug back to the engine-backend component, where the fix should be done.

Comment 16 Moti Asayag 2014-05-22 19:37:19 UTC
The field speed should be moved from the the vds_interface table to the vds_statistics which is updated via the getVdsStats verb, called every 15 seconds.

The data-warehouse tables/views should be updated accordingly.

Comment 17 Yaniv Lavi 2014-07-02 11:11:28 UTC
(In reply to Moti Asayag from comment #16)
> The field speed should be moved from the the vds_interface table to the
> vds_statistics which is updated via the getVdsStats verb, called every 15
> seconds.
> 
> The data-warehouse tables/views should be updated accordingly.

Why do you want to move a slow changing value [how many times a day do you change interface speed??] to a table that changes every 15 seconds. It makes no sense and it will cause damage to the collection process to data warehouse. please consider another approach.

Fix the issue, don't do ugly workaround to get this value updated. It will still return 0 for part of the time which is incorrect.

Comment 18 Lior Vernia 2014-07-21 11:01:24 UTC
After an exhaustive design discussion, we will most likely ameliorate the situation by solving Bug 999947, hopefully soon.

This will mean that NICs won't stay for long at speed 0 (unless there's a real issue), but 0 will still be displayed when a NIC is being brought up. We can either close this bug or keep it open as a lower priority bug to be solved in the farther future (most likely on the vdsm side).

Comment 19 Barak 2015-05-04 09:49:56 UTC
Motti,

Do you think this can be also a candidate for vdsm-engine event ?

Comment 20 Barak 2015-05-04 09:54:49 UTC
In addition can we avoid saving the speed when the value is 0 into the DB ?

Comment 21 Moti Asayag 2015-05-18 12:04:50 UTC
(In reply to Barak from comment #19)
> Moti,
> 
> Do you think this can be also a candidate for vdsm-engine event ?

Yes:

According to vdsm guys' comments (comment 12 and comment 13), once ifup completes, the information for the speed of the interface should be available.
The same 'ifup' which brings the interface up for the sake of obtaining an address from the 'dhcp' could be used here as well to report the entire interface configuration, so the engine would be able to update the interface on its side without issuing further calls to vdsm.

> In addition can we avoid saving the speed when the value is 0 into the DB ?

Yes, there is no constraint for that field and it seems that the access for its value is usually null-protected. Hence null might be a valid value for it.

If DWH expects it to always be not null, it can be addressed by modifying dwh_host_interface_configuration_history_view.

Comment 22 Yaniv Lavi 2016-05-09 10:57:05 UTC
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 25 Dan Kenigsberg 2017-07-12 10:36:13 UTC
With https://gerrit.ovirt.org/#/c/70357/ merged, this should be already testable.

Comment 26 Edward Haas 2017-09-24 11:55:55 UTC
Could this issue be re-checked?

Comment 27 Michael Burman 2017-09-24 12:07:23 UTC
(In reply to Edward Haas from comment #26)
> Could this issue be re-checked?

1) The patch in comment#25 has nothing to do with this report.
2) The comments about dhcp and static bootproto are not related to the origin report in any way.
3) The origin report basically said that UI and DB report speed '0' instead of speed 1000(as reported on stats and caps) when attaching network to NIC(no bootproto is mentioned)

I can re-check that on current master we do report speed 1000 in the UI and DB if the stats and caps report it properly.

Comment 28 Michael Burman 2017-09-24 12:43:01 UTC
This report does reproduced from time to time(every few attempts) when attaching a network to a NIC which has speed 1000. 

- When attaching the network, the speed changed from 1000 to 0 in the DB and reported as N/A in the UI.
This is exactly what was described by the reporter. Nothing more nothing less. manual refresh caps make the trick and UI and DB report 1000 again as should be.

1) Start point, NIC with speed 1000
2) Attach network to the NIC(no bootproto)
3) Results in:
UI - speed is N/A

DB - speed is zero

-[ RECORD 1 ]--------+-------------------------------------
id                   | 3ea316c2-036e-44f9-80f5-b671e64864a7
name                 | ens1f0
network_name         | sr-net
vds_id               | bcd15e3f-5ccb-4626-90b1-f65927ed9ae2
mac_addr             | 00:15:17:3d:cd:ce
is_bond              | f
bond_name            | 
bond_type            | 
bond_opts            | 
vlan_id              | 
speed                | 0

getStats - 
 }, 
        "ens1f0": {
            "rxErrors": "0", 
            "name": "ens1f0", 
            "tx": "2222022", 
            "txDropped": "0", 
            "sampleTime": 1506256842.166226, 
            "rx": "106918367", 
            "txErrors": "0", 
            "state": "up", 
            "speed": "1000", 
            "rxDropped": "221"

getCapabilities -
     "gateway": ""
        }, 
        "ens1f0": {
            "ipv6autoconf": false, 
            "addr": "", 
            "ipv6gateway": "::", 
            "dhcpv6": false, 
            "ipv6addrs": [], 
            "mtu": "1500", 
            "dhcpv4": false, 
            "netmask": "", 
            "ipv4defaultroute": false, 
            "ipv4addrs": [], 
            "hwaddr": "00:15:17:3d:cd:ce", 
            "speed": 1000, 
            "gateway": ""

In the bottom line, nothing has changed since this bug was reported.

Comment 29 Yaniv Kaul 2017-10-26 11:36:15 UTC
Can this be moved to MODIFIED or additional patches are needed?

Comment 30 Michael Burman 2017-10-26 12:02:28 UTC
The last patch that was provided for me from Edy, has failedQa. I don't think he has added new patches since then.

Comment 31 Michael Burman 2017-10-26 12:03:48 UTC
(In reply to Michael Burman from comment #30)
> The last patch that was provided for me from Edy, has failedQa. I don't
> think he has added new patches since then.

Sorry, please ignore this comment^^. I confused with the bond speed bug.

Comment 32 rhev-integ 2017-11-02 13:39:38 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 34 rhev-integ 2017-11-02 21:09:04 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'vdsm'/Component 'ovirt-engine' mismatch]

For more info please contact: rhv-devops

Comment 35 Michael Burman 2017-11-07 09:23:49 UTC
Verified on - 4.2.0-0.4.master.el7 and vdsm-4.20.6-1.el7ev.x86_64

Comment 36 RHV bug bot 2017-12-06 16:17:16 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'vdsm'/Component 'ovirt-engine' mismatch]

For more info please contact: rhv-devops

Comment 37 RHV bug bot 2017-12-12 21:15:53 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'vdsm'/Component 'ovirt-engine' mismatch]

For more info please contact: rhv-devops

Comment 40 errata-xmlrpc 2018-05-15 17:36:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 41 Franta Kust 2019-05-16 13:08:21 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.