Created attachment 968363 [details] vdsm-error while sampling Description of problem: Vdsm reports wrong NIC state, Error while sampling stats. After configuring ethtool on a host NIC(eth2) via GUI, eth2 reported as down, even after 'refresh capabilities'. - kernel reports NIC is up: ip a| grep eth2 eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 - in the event log eth2 was reported as down in 15:46 'Interface eth2 on host orange-vdsc.qa.lab.tlv.redhat.com, changed state to down' - connectivity.log report eth2 and eth2.164 as down - vdsStats report eth2 and eth2.164 as down, when there is no vlan actually attached to NIC any more. - vdsCaps report eth2 without vlan - In setupNetworks there is no network attached to eth2 NIC It seems that we have a race when an interface disappears while sampling its statistics. Thread-12::ERROR::2014-12-10 15:48:11,842::sampling::534::vds:run) Error while sampling stats Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 516, in run sample = self.sample() File "/usr/share/vdsm/virt/sampling.py", line 506, in sample hs = HostSample(self._pid) File "/usr/share/vdsm/virt/sampling.py", line 261, in __init__ (link.name, InterfaceSample(link)) for link in getLinks()) File "/usr/share/vdsm/virt/sampling.py", line 261, in <genexpr> (link.name, InterfaceSample(link)) for link in getLinks()) File "/usr/share/vdsm/virt/sampling.py", line 112, in __init__ self.speed = _getLinkSpeed(link) File "/usr/share/vdsm/virt/sampling.py", line 690, in _getLinkSpeed speed = netinfo.vlanSpeed(dev.name) File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 224, in vlanSpeed vlanDevName = getVlanDevice(vlanName) File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 756, in getVlanDevice vlanLink = getLink(vlan) File "/usr/lib/python2.6/site-packages/vdsm/ipwrapper.py", line 300, in getLink return Link.fromDict(netlink.get_link(dev)) File "/usr/lib/python2.6/site-packages/vdsm/netlink.py", line 66, in get_link name) IOError: [Errno 19] eth2.164 is not present in the system Version-Release number of selected component (if applicable): 3.5.0-0.23.beta.el6ev vdsm-4.16.8.1-2.el6ev.x86_64 Relevant host - orange-vdsc.qa.lab.tlv.redhat.com Upgrade engine- 10.35.161.37 Relevant time: :2014-12-10 15:46:11
Marking this for 3.5.z as we don't know how common this race is, and it can be quite annoying for users to encounter it. Based on Ido's input I understand this bug was introduced in 3.5, so no need to backport further. Dan, feel free to override me :)
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2
The code has already been merged to the stable branch, and would be part of rhev-3.5.1. It solves a rare race, and has been tested not to cause regressions elsewhere. It does not need a specific z-stream QE.
Dan, On which version this bug should be tested? 3.6? Is vdsm-4.17.0-632.git19a83a2.el7.x86_64 includes this fix? Thanks,
to find where this was fixed in the master branch, take note of the fixing patch https://gerrit.ovirt.org/#/c/36138/. `git log --grep 36138 19a83a2` shows that indeed it exists in your 19a83a2 build.
Dan, i need the exact qa build version to test this. Thanks. Fixed in version must be provided when moving bugs to ON_QA. If we have a build for qa, then fixed in version must be set. We are not testing from nightly master any more.
As I said, vdsm-4.17.0-632.git19a83a2.el7.x86_64 includes the patch. I also explain how you can verify this yourself in the future.
Thank you Dan, I know i can verify this by my self, but it shouldn't be this way, this information must be set when moving bugs to ON_QA, specially, when there is a qa build. Verified on - 3.6.0-0.0.master.20150412172306.git55ba764.el6 with vdsm-4.17.0-632.git19a83a2.el7.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html