Bug 1173929 - Vdsm reports wrong NIC state, Error while sampling stats
Summary: Vdsm reports wrong NIC state, Error while sampling stats
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: x86_64
OS: Linux
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Dan Kenigsberg
QA Contact: Michael Burman
Depends On:
TreeView+ depends on / blocked
Reported: 2014-12-14 08:03 UTC by Michael Burman
Modified: 2016-03-09 19:27 UTC (History)
10 users (show)

Fixed In Version: vdsm-4.17.0-632.git19a83a2.el7.x86_64
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2016-03-09 19:27:50 UTC
oVirt Team: Network
Target Upstream Version:
ylavi: Triaged+

Attachments (Terms of Use)
vdsm-error while sampling (760.24 KB, application/x-gzip)
2014-12-14 08:03 UTC, Michael Burman
no flags Details

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 0 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 36138 0 master MERGED fixing race while sampling interfaces Never
oVirt gerrit 36685 0 ovirt-3.5 MERGED fixing race while sampling interfaces Never
oVirt gerrit 40097 0 None NEW Signs tests as broken due to wrong mocking Never

Description Michael Burman 2014-12-14 08:03:59 UTC
Created attachment 968363 [details]
vdsm-error while sampling

Description of problem:
Vdsm reports wrong NIC state, Error  while sampling stats.

After configuring ethtool on a host NIC(eth2) via GUI, eth2 reported as down, even after 'refresh capabilities'. 
- kernel reports NIC is up:
ip a| grep eth2
eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

- in the event log eth2 was reported as down in 15:46 'Interface eth2 on host orange-vdsc.qa.lab.tlv.redhat.com, changed state to down' 

- connectivity.log report eth2 and eth2.164 as down
- vdsStats report eth2 and eth2.164 as down, when there is no vlan actually attached to NIC any more.
- vdsCaps report eth2 without vlan
- In setupNetworks there is no network attached to eth2 NIC

It seems that we have a race when an interface disappears while sampling its

Thread-12::ERROR::2014-12-10 15:48:11,842::sampling::534::vds:run) Error while sampling stats
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 516, in run
    sample = self.sample()
  File "/usr/share/vdsm/virt/sampling.py", line 506, in sample
    hs = HostSample(self._pid)
  File "/usr/share/vdsm/virt/sampling.py", line 261, in __init__
    (link.name, InterfaceSample(link)) for link in getLinks())
  File "/usr/share/vdsm/virt/sampling.py", line 261, in <genexpr>
    (link.name, InterfaceSample(link)) for link in getLinks())
  File "/usr/share/vdsm/virt/sampling.py", line 112, in __init__
    self.speed = _getLinkSpeed(link)
  File "/usr/share/vdsm/virt/sampling.py", line 690, in _getLinkSpeed
    speed = netinfo.vlanSpeed(dev.name)
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 224, in vlanSpeed
    vlanDevName = getVlanDevice(vlanName)
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 756, in getVlanDevice
    vlanLink = getLink(vlan)
  File "/usr/lib/python2.6/site-packages/vdsm/ipwrapper.py", line 300, in getLink
    return Link.fromDict(netlink.get_link(dev))
  File "/usr/lib/python2.6/site-packages/vdsm/netlink.py", line 66, in get_link
IOError: [Errno 19] eth2.164 is not present in the system

Version-Release number of selected component (if applicable):

Relevant host - orange-vdsc.qa.lab.tlv.redhat.com
Upgrade engine-
Relevant time: :2014-12-10 15:46:11

Comment 1 Lior Vernia 2014-12-14 12:55:41 UTC
Marking this for 3.5.z as we don't know how common this race is, and it can be quite annoying for users to encounter it. Based on Ido's input I understand this bug was introduced in 3.5, so no need to backport further. Dan, feel free to override me :)

Comment 2 Eyal Edri 2015-02-25 08:45:35 UTC
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2

Comment 3 Dan Kenigsberg 2015-02-25 09:26:07 UTC
The code has already been merged to the stable branch, and would be part of rhev-3.5.1. It solves a rare race, and has been tested not to cause regressions elsewhere. It does not need a specific z-stream QE.

Comment 4 Michael Burman 2015-04-21 05:40:25 UTC

On which version this bug should be tested? 3.6?
Is vdsm-4.17.0-632.git19a83a2.el7.x86_64 includes this fix?


Comment 5 Dan Kenigsberg 2015-04-21 09:11:05 UTC
to find where this was fixed in the master branch, take note of the fixing patch https://gerrit.ovirt.org/#/c/36138/.

`git log --grep 36138 19a83a2` shows that indeed it exists in your 19a83a2 build.

Comment 6 Michael Burman 2015-04-21 14:59:19 UTC
Dan, i need the exact qa build version to test this. Thanks.

Fixed in version must be provided when moving bugs to ON_QA.
If we have a build for qa, then fixed in version must be set.
We are not testing from nightly master any more.

Comment 7 Dan Kenigsberg 2015-04-21 16:27:37 UTC
As I said, vdsm-4.17.0-632.git19a83a2.el7.x86_64 includes the patch.
I also explain how you can verify this yourself in the future.

Comment 8 Michael Burman 2015-04-22 08:02:51 UTC
Thank you Dan, 
I know i can verify this by my self, but it shouldn't be this way, 
this information must be set when moving bugs to ON_QA, specially, when there is a qa build. 

Verified on -  3.6.0-0.0.master.20150412172306.git55ba764.el6 with

Comment 12 errata-xmlrpc 2016-03-09 19:27:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.