Bug 1113948
Summary: | Vdsm sampling threads unexpectingly stops with IOError ENODEV | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> | |
Component: | vdsm | Assignee: | Barak <bazulay> | |
Status: | CLOSED ERRATA | QA Contact: | Martin Pavlik <mpavlik> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.4.0 | CC: | akotov, bazulay, danken, dossow, fromani, iheim, jcoscia, lpeer, luvilla, myakove, nyechiel, oourfali, pzhukov, yeylon | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | 3.5.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | network | |||
Fixed In Version: | ovirt-3.5.0-beta1.1 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, when a device was removed while sampling network device data, it would raise an Exception, invalidating the whole sampling. Now, when failing to retrieve data for a network device due to it being removed, it is deleted from the output so that the rest of the sampling can continue.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1129377 (view as bug list) | Environment: | ||
Last Closed: | 2015-02-11 21:11:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1129377, 1142923, 1156165 |
Description
Roman Hodain
2014-06-27 10:14:50 UTC
Possible cause: m3-isp03-2014062608111403770284/var/log/vdsm/vdsm.log 46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats 46760 Traceback (most recent call last): 46761 File "/usr/share/vdsm/sampling.py", line 412, in run 46762 File "/usr/share/vdsm/sampling.py", line 401, in sample 46763 File "/usr/share/vdsm/sampling.py", line 388, in _updateIfidsIfrates 46764 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 286, in getLinks 46765 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 147, in fromDict 46766 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 172, in _detectType 46767 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 268, in _drvinfo 46768 IOError: [Errno 19] No such device 46769 VM Channels Listener::ERROR::2014-06-26 07:27:51,421::vmChannels::53::vds::(_handle_event) Received 00000011 on fileno 165 We got IOError ENODEV exception which is not caught. We catch IOError but raise it again it the exception not EOPNOTSUPP. lib/vdsm/ipwrapper.py: 171 try: 172 driver = _drvinfo(name) 173 except IOError as ioe: 174 if ioe.errno == errno.EOPNOTSUPP: 175 if name == 'lo': 176 detectedType = LinkType.LOOPBACK 177 else: 178 detectedType = LinkType.DUMMY 179 return detectedType 180 else: 181 raise # Reraise other errors like ENODEV <========= There is no other except section on the way up till the run method where the thread body is running. As this is not vm.TimeoutError, the exception is moving up and is caught out of the main while loop so the thread ends here. def run(self): 406 import vm 407 try: 408 # wait a bit before starting to sample 409 time.sleep(self.SAMPLE_INTERVAL_SEC) 410 while not self._stopEvent.isSet(): <=== While loop 411 try: 412 sample = self.sample() 413 self._samples.append(sample) 414 self._lastSampleTime = sample.timestamp 415 if len(self._samples) > self.AVERAGING_WINDOW: 416 self._samples.pop(0) 417 except vm.TimeoutError: 418 self._log.error("Timeout while sampling stats", 419 exc_info=True) 420 self._stopEvent.wait(self.SAMPLE_INTERVAL_SEC) 421 except: <=== We get the exception out of the loop 422 if not self._stopEvent.isSet(): 423 self._log.error("Error while sampling stats", exc_info=True) We get the links here: lib/vdsm/ipwrapper.py: 284 def getLinks(): 285 """Returns a list of Link objects for each link in the system.""" 286 return [Link.fromDict(data) for data in netlink.iter_links()] netlink.iter_links() should return all devices including the vnet devices. I am now just guessing but it is possible a VM was powered off just after we got the list of devices and thus the device disappeared. We see that VM 576c0b26-25ea-4884-bd07-f810ffaff87a was killed just about the time when we got the exception. vdsm.log: 46755 Thread-8080::WARNING::2014-06-26 07:27:50,046::vm::822::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration stalling: remaining (709MiB) > lowmark (632MiB). Refer to RHBZ#919201. 46756 Thread-8080::INFO::2014-06-26 07:27:50,047::vm::834::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration Progress: 160 seconds elapsed, 92% of data processed, 92% of mem processed ... 46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats ... 46772 VM Channels Listener::DEBUG::2014-06-26 07:27:51,433::guestIF::208::vm.Vm::(_connect) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Failed to connect to /var/lib/libvirt/qemu/channels/576c0b26-25ea-4884-bd07-f810ffaff87a.com.redha 46772 t.rhevm.vdsm with 111 ... 46782 libvirtEventLoop::DEBUG::2014-06-26 07:27:52,055::vm::5152::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::event Stopped detail 3 opaque None On top of the network issue, the sampling code should not give up so easily. I'll address this with a patch. I'll change the code so that if a device goes missing it is just deleted from the output but the rest remain. verified with ovirt beta vdsm-4.16.0-42.git3bfad86.fc19.x86_64 no sign of "Error while sampling stats" in logs of heavily used setup for last few days Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html |