Bug 1173929

Summary: Vdsm reports wrong NIC state, Error while sampling stats
Product: Red Hat Enterprise Virtualization Manager Reporter: Michael Burman <mburman>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED ERRATA QA Contact: Michael Burman <mburman>
Severity: high Docs Contact:
Priority: medium    
Version: 3.5.0CC: bazulay, danken, gklein, lpeer, lsurette, myakove, nyechiel, ybronhei, yeylon, ykaul
Target Milestone: ovirt-3.6.0-rcFlags: ylavi: Triaged+
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.17.0-632.git19a83a2.el7.x86_64 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-09 19:27:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
vdsm-error while sampling none

Description Michael Burman 2014-12-14 08:03:59 UTC
Created attachment 968363 [details]
vdsm-error while sampling

Description of problem:
Vdsm reports wrong NIC state, Error  while sampling stats.

After configuring ethtool on a host NIC(eth2) via GUI, eth2 reported as down, even after 'refresh capabilities'. 
- kernel reports NIC is up:
ip a| grep eth2
eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

- in the event log eth2 was reported as down in 15:46 'Interface eth2 on host orange-vdsc.qa.lab.tlv.redhat.com, changed state to down' 

- connectivity.log report eth2 and eth2.164 as down
- vdsStats report eth2 and eth2.164 as down, when there is no vlan actually attached to NIC any more.
- vdsCaps report eth2 without vlan
- In setupNetworks there is no network attached to eth2 NIC

It seems that we have a race when an interface disappears while sampling its
statistics.

Thread-12::ERROR::2014-12-10 15:48:11,842::sampling::534::vds:run) Error while sampling stats
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 516, in run
    sample = self.sample()
  File "/usr/share/vdsm/virt/sampling.py", line 506, in sample
    hs = HostSample(self._pid)
  File "/usr/share/vdsm/virt/sampling.py", line 261, in __init__
    (link.name, InterfaceSample(link)) for link in getLinks())
  File "/usr/share/vdsm/virt/sampling.py", line 261, in <genexpr>
    (link.name, InterfaceSample(link)) for link in getLinks())
  File "/usr/share/vdsm/virt/sampling.py", line 112, in __init__
    self.speed = _getLinkSpeed(link)
  File "/usr/share/vdsm/virt/sampling.py", line 690, in _getLinkSpeed
    speed = netinfo.vlanSpeed(dev.name)
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 224, in vlanSpeed
    vlanDevName = getVlanDevice(vlanName)
  File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 756, in getVlanDevice
    vlanLink = getLink(vlan)
  File "/usr/lib/python2.6/site-packages/vdsm/ipwrapper.py", line 300, in getLink
    return Link.fromDict(netlink.get_link(dev))
  File "/usr/lib/python2.6/site-packages/vdsm/netlink.py", line 66, in get_link
    name)
IOError: [Errno 19] eth2.164 is not present in the system


Version-Release number of selected component (if applicable):
3.5.0-0.23.beta.el6ev
vdsm-4.16.8.1-2.el6ev.x86_64

Relevant host - orange-vdsc.qa.lab.tlv.redhat.com
Upgrade engine- 10.35.161.37
Relevant time: :2014-12-10 15:46:11

Comment 1 Lior Vernia 2014-12-14 12:55:41 UTC
Marking this for 3.5.z as we don't know how common this race is, and it can be quite annoying for users to encounter it. Based on Ido's input I understand this bug was introduced in 3.5, so no need to backport further. Dan, feel free to override me :)

Comment 2 Eyal Edri 2015-02-25 08:45:35 UTC
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2

Comment 3 Dan Kenigsberg 2015-02-25 09:26:07 UTC
The code has already been merged to the stable branch, and would be part of rhev-3.5.1. It solves a rare race, and has been tested not to cause regressions elsewhere. It does not need a specific z-stream QE.

Comment 4 Michael Burman 2015-04-21 05:40:25 UTC
Dan,

On which version this bug should be tested? 3.6?
Is vdsm-4.17.0-632.git19a83a2.el7.x86_64 includes this fix?

Thanks,

Comment 5 Dan Kenigsberg 2015-04-21 09:11:05 UTC
to find where this was fixed in the master branch, take note of the fixing patch https://gerrit.ovirt.org/#/c/36138/.

`git log --grep 36138 19a83a2` shows that indeed it exists in your 19a83a2 build.

Comment 6 Michael Burman 2015-04-21 14:59:19 UTC
Dan, i need the exact qa build version to test this. Thanks.

Fixed in version must be provided when moving bugs to ON_QA.
If we have a build for qa, then fixed in version must be set.
We are not testing from nightly master any more.

Comment 7 Dan Kenigsberg 2015-04-21 16:27:37 UTC
As I said, vdsm-4.17.0-632.git19a83a2.el7.x86_64 includes the patch.
I also explain how you can verify this yourself in the future.

Comment 8 Michael Burman 2015-04-22 08:02:51 UTC
Thank you Dan, 
I know i can verify this by my self, but it shouldn't be this way, 
this information must be set when moving bugs to ON_QA, specially, when there is a qa build. 

Verified on -  3.6.0-0.0.master.20150412172306.git55ba764.el6 with
vdsm-4.17.0-632.git19a83a2.el7.x86_64

Comment 12 errata-xmlrpc 2016-03-09 19:27:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html