Bug 980363 - Incorrect speed is displayed for the interface of the host
Incorrect speed is displayed for the interface of the host
Status: MODIFIED
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.3.0
Unspecified Unspecified
medium Severity medium
: ovirt-4.2.0
: ---
Assigned To: Edward Haas
Meni Yakove
:
Depends On: 1240719
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-02 03:35 EDT by GenadiC
Modified: 2017-07-17 05:21 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1240719 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
eth3 shows 0 speed while it is supposed to be 1000 (1.79 MB, image/png)
2013-07-02 03:35 EDT, GenadiC
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 28913 master ABANDONED engine: move speed property from VdsNetworkInterface to VdsNetworkStatistics Never

  None (edit)
Description GenadiC 2013-07-02 03:35:00 EDT
Created attachment 767656 [details]
eth3 shows 0 speed while it is supposed to be 1000

Description of problem:
After attaching network to the Host NIC the speed of 0 or 1 is displayed in the GUI and is shown id DB. 
getVdsCaps and getVdsStats report the correct speed of 1000

Version-Release number of selected component (if applicable):


How reproducible:
Sometime

Steps to Reproduce:
1. Create network in the Cluster
2. Attach network to the host with SetupNetworks
3.

Actual results:
 The speed of interface is changed from 1000 to 0 or 1 in DB and GUI

Expected results:
The speed should remain 1000

Additional info:
Output of VdsStats for eth3 
'eth3': {'macAddr': '',
                            'name': 'eth3',
                            'rxDropped': '0',
                            'rxErrors': '0',
                            'rxRate': '0.0',
                            'speed': '1000',
                            'state': 'up',
                            'txDropped': '0',
                            'txErrors': '0',
                            'txRate': '0.0'}}
Comment 1 Dan Kenigsberg 2013-07-04 10:40:07 EDT
Could you attach relevant engine.log, or provide a hint on how this can be reproduced?

What happens when you refresh the host network info (now that we have a button for that)?
Comment 2 GenadiC 2013-07-08 06:26:01 EDT
There is no info in the engine log.
In the DB we see that the speed is incorrect (0 or 1) and getVdsStats and getVdsCaps show the correct speed of 1000.
Refresh doesn't solve the issue as well
Comment 5 Moti Asayag 2013-09-09 04:04:44 EDT
In ovirt-engine-3.3 we introduced a new action on host level named 'refresh capabilities'. It refreshes the host capabilities and the actual network configuration.

By looking at Genadi's host, it appears that the immediate getCapabilities after the setupNetworks command reports the speed as '0', why few seconds later it is reported as '1000'. This is another issue that can be looked further on VDSM side on how the speed is being evaluated.

Using the 'refresh capabilities' manually updates the speed of the host nics correctly, and with Bug 999947, it will be done automatically.
Comment 6 Assaf Muller 2013-10-02 04:41:15 EDT
The bug is because when (For example) deleting a network from a nic VDSM calls ifdown, removes the network configuration from the nic and then calls ifup. When ifup returns it's done VDSM reports that the setupNetworks command has finished and the engine sends a getVdsCaps.

The issue is that by time the getVdsCaps reaches VDSM, the kernel still hasn't managed to bring the device up (Even though ifup reported it's done), and when a device is down the speed is reported as 0.

In other words it's raceful.

Specifically for the reporting of speed we can use alternative methods to report it (Currently we look at /sys/class/net/$device/speed, which is inaccessible when the device is down) however ethtool (For example) has other issues.

To actually solve the issue (Which is on a larger scale than just the reporting of speed) we'd have, as far as I see, to make pretty drastic infrastructure changes. For example, to add the ability for VDSM to send information to the engine. I suppose such a change will be considered for the next major release. For 3.4 the engine can poll VDSM every T seconds as Moti has already suggested.
Comment 7 Jamie Bainbridge 2013-10-21 20:56:38 EDT
(In reply to Assaf Muller from comment #6)
> The bug is because...
> I suppose such a change will be considered for the next major release.

Ah, that's excellent, thank you very much for the root cause description.

Could we please log a new bug to make sure we investigate this race within the next major release? I didn't see a 4.0 in Version or Target so I'm not sure how to track that properly.

We should then be able to set this bug to CLOSED CURRENTRELEASE with the refresh fix provided by Bug 999947.
Comment 8 Assaf Muller 2014-02-19 11:43:50 EST
Instead of closing this bug and opening another I moved this bug's component to VDSM with a possible solution. Dan, Toni, if you dislike this solution I think an alternative VDSM solution is the most feasible approach. If we fail to find a VDSM solution then we can close this bug and open another one with some other component and target release.

Here's a VDSM solution:

In VDSM's ifup function, after the ifup script returns, perform a 'wait' on the NIC. More specifically, poll for its state in some busy loop with a reasonable timeout, and only after it's up (We know it should go up because we just ifup'd it and ifup returned successfully), exit the function. It will lengthen VDSM's ifup execution time by, for example, a second; But it will ensure that VDSM's ifup (And thus setupNetworks) will return only when the NIC is actually up, and so the first getVdsCaps after setupNetworks finishes will have an UP NIC to sample.
Comment 9 Antoni Segura Puimedon 2014-02-19 11:58:19 EST
The only issue is that we have async ifupping, i.e., when the engine requests dhcp ip configuration for a network we delegate the interface upping to another thread and we return, so that the UI is not just waiting. IMHO it would make sense that whatever the engine does in the dhcp case, it would do as well for the general case.
Comment 10 Assaf Muller 2014-02-19 12:12:51 EST
When you do an ifup, does it return only after it gets a DHCP response, or after it sends the command to bring the NIC up, but before the DHCP response is arrived? I think that if ifup returns before it receives a DHCP response, then we can switch over to do synchronous ifups, add the loop that waits until the NIC is up, then return.
Comment 12 Antoni Segura Puimedon 2014-02-19 12:32:43 EST
We moved to async for ifup, i.e., spawn a thread that will call dhcp and immediately return (so probably even before dhclient has been called) so that the UI would be more responsive.

Moving to all async and waiting is a possibility, of course, but that would undo the change vdsm was requested to make for UI responsiveness. All in all, the best would be that setupNetworks were made an asynchronous task and that vdsm would mark it as done after the kernel effectively put up the device/s.
Comment 13 Dan Kenigsberg 2014-02-19 16:46:40 EST
There's nothing much we can do within Vdsm if this is a dhcp-defined network. Engine may request setupNetworks to wait until the new device is up with blockingdhcp=True, or poll getVdsCaps until the changes takes effect (or a timeout expires).

I've heard reports that even with blockingdhcp=True, there's a race within the ifup script, which may return before the ifup'ed device is completely usable. In my opinion, such an issue should be fixed within initscripts, iproute2, or the kernel, though we could hack it with a tight loop described in commit 8.

Jaimie, have you seen the issue happen with static addresses?
Can you reproduce this without Vdsm, i.e. by writing and ifcfg, calling ifup, and immediately reading /sys/class/net/DEVICE/speed ?
Comment 14 Jamie Bainbridge 2014-02-19 18:33:22 EST
I am not sure if this is limited to static or DHCP networks.

However, when calling ifup on a physical interface, it does take some time for /sys/class/net/DEVICE/speed to read the proper speed. On the interface I tried, the physical interface speed remains at "-1" for a second or two, before going to "1000". I'd assume this is due to link speed negotiation between the NIC and the switchport.
Comment 15 Moti Asayag 2014-02-20 16:24:15 EST
This bug could be fixed on the engine side by using the speed reported now (since 3.4) for each device, therefore as part of the host's statistics collecting, the engine should preserve the speed for the device.

Moving the bug back to the engine-backend component, where the fix should be done.
Comment 16 Moti Asayag 2014-05-22 15:37:19 EDT
The field speed should be moved from the the vds_interface table to the vds_statistics which is updated via the getVdsStats verb, called every 15 seconds.

The data-warehouse tables/views should be updated accordingly.
Comment 17 Yaniv Lavi (Dary) 2014-07-02 07:11:28 EDT
(In reply to Moti Asayag from comment #16)
> The field speed should be moved from the the vds_interface table to the
> vds_statistics which is updated via the getVdsStats verb, called every 15
> seconds.
> 
> The data-warehouse tables/views should be updated accordingly.

Why do you want to move a slow changing value [how many times a day do you change interface speed??] to a table that changes every 15 seconds. It makes no sense and it will cause damage to the collection process to data warehouse. please consider another approach.

Fix the issue, don't do ugly workaround to get this value updated. It will still return 0 for part of the time which is incorrect.
Comment 18 Lior Vernia 2014-07-21 07:01:24 EDT
After an exhaustive design discussion, we will most likely ameliorate the situation by solving Bug 999947, hopefully soon.

This will mean that NICs won't stay for long at speed 0 (unless there's a real issue), but 0 will still be displayed when a NIC is being brought up. We can either close this bug or keep it open as a lower priority bug to be solved in the farther future (most likely on the vdsm side).
Comment 19 Barak 2015-05-04 05:49:56 EDT
Motti,

Do you think this can be also a candidate for vdsm-engine event ?
Comment 20 Barak 2015-05-04 05:54:49 EDT
In addition can we avoid saving the speed when the value is 0 into the DB ?
Comment 21 Moti Asayag 2015-05-18 08:04:50 EDT
(In reply to Barak from comment #19)
> Moti,
> 
> Do you think this can be also a candidate for vdsm-engine event ?

Yes:

According to vdsm guys' comments (comment 12 and comment 13), once ifup completes, the information for the speed of the interface should be available.
The same 'ifup' which brings the interface up for the sake of obtaining an address from the 'dhcp' could be used here as well to report the entire interface configuration, so the engine would be able to update the interface on its side without issuing further calls to vdsm.

> In addition can we avoid saving the speed when the value is 0 into the DB ?

Yes, there is no constraint for that field and it seems that the access for its value is usually null-protected. Hence null might be a valid value for it.

If DWH expects it to always be not null, it can be addressed by modifying dwh_host_interface_configuration_history_view.
Comment 22 Yaniv Lavi (Dary) 2016-05-09 06:57:05 EDT
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.
Comment 25 Dan Kenigsberg 2017-07-12 06:36:13 EDT
With https://gerrit.ovirt.org/#/c/70357/ merged, this should be already testable.

Note You need to log in before you can comment on or make changes to this bug.