Description of problem: The vdsm sampling thread stops collecting statistics. This causes the host statistics (CPU, Memory, Network) in the WebAdmin portal to get frozen. Version-Release number of selected component (if applicable): vdsm-4.14.7-3.el6ev.x86_64 How reproducible: sometimes. It is question of timing Steps to Reproduce: 1. Stop VM on the hypervisor Actual results: Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats Traceback (most recent call last): File "/usr/share/vdsm/sampling.py", line 412, in run File "/usr/share/vdsm/sampling.py", line 401, in sample File "/usr/share/vdsm/sampling.py", line 388, in _updateIfidsIfrates File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 286, in getLinks File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 147, in fromDict File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 172, in _detectType File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 268, in _drvinfo IOError: [Errno 19] No such device Expected results: The sampling thread remains runing.
Possible cause: m3-isp03-2014062608111403770284/var/log/vdsm/vdsm.log 46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats 46760 Traceback (most recent call last): 46761 File "/usr/share/vdsm/sampling.py", line 412, in run 46762 File "/usr/share/vdsm/sampling.py", line 401, in sample 46763 File "/usr/share/vdsm/sampling.py", line 388, in _updateIfidsIfrates 46764 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 286, in getLinks 46765 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 147, in fromDict 46766 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 172, in _detectType 46767 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 268, in _drvinfo 46768 IOError: [Errno 19] No such device 46769 VM Channels Listener::ERROR::2014-06-26 07:27:51,421::vmChannels::53::vds::(_handle_event) Received 00000011 on fileno 165 We got IOError ENODEV exception which is not caught. We catch IOError but raise it again it the exception not EOPNOTSUPP. lib/vdsm/ipwrapper.py: 171 try: 172 driver = _drvinfo(name) 173 except IOError as ioe: 174 if ioe.errno == errno.EOPNOTSUPP: 175 if name == 'lo': 176 detectedType = LinkType.LOOPBACK 177 else: 178 detectedType = LinkType.DUMMY 179 return detectedType 180 else: 181 raise # Reraise other errors like ENODEV <========= There is no other except section on the way up till the run method where the thread body is running. As this is not vm.TimeoutError, the exception is moving up and is caught out of the main while loop so the thread ends here. def run(self): 406 import vm 407 try: 408 # wait a bit before starting to sample 409 time.sleep(self.SAMPLE_INTERVAL_SEC) 410 while not self._stopEvent.isSet(): <=== While loop 411 try: 412 sample = self.sample() 413 self._samples.append(sample) 414 self._lastSampleTime = sample.timestamp 415 if len(self._samples) > self.AVERAGING_WINDOW: 416 self._samples.pop(0) 417 except vm.TimeoutError: 418 self._log.error("Timeout while sampling stats", 419 exc_info=True) 420 self._stopEvent.wait(self.SAMPLE_INTERVAL_SEC) 421 except: <=== We get the exception out of the loop 422 if not self._stopEvent.isSet(): 423 self._log.error("Error while sampling stats", exc_info=True) We get the links here: lib/vdsm/ipwrapper.py: 284 def getLinks(): 285 """Returns a list of Link objects for each link in the system.""" 286 return [Link.fromDict(data) for data in netlink.iter_links()] netlink.iter_links() should return all devices including the vnet devices. I am now just guessing but it is possible a VM was powered off just after we got the list of devices and thus the device disappeared. We see that VM 576c0b26-25ea-4884-bd07-f810ffaff87a was killed just about the time when we got the exception. vdsm.log: 46755 Thread-8080::WARNING::2014-06-26 07:27:50,046::vm::822::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration stalling: remaining (709MiB) > lowmark (632MiB). Refer to RHBZ#919201. 46756 Thread-8080::INFO::2014-06-26 07:27:50,047::vm::834::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration Progress: 160 seconds elapsed, 92% of data processed, 92% of mem processed ... 46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats ... 46772 VM Channels Listener::DEBUG::2014-06-26 07:27:51,433::guestIF::208::vm.Vm::(_connect) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Failed to connect to /var/lib/libvirt/qemu/channels/576c0b26-25ea-4884-bd07-f810ffaff87a.com.redha 46772 t.rhevm.vdsm with 111 ... 46782 libvirtEventLoop::DEBUG::2014-06-26 07:27:52,055::vm::5152::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::event Stopped detail 3 opaque None
On top of the network issue, the sampling code should not give up so easily. I'll address this with a patch.
I'll change the code so that if a device goes missing it is just deleted from the output but the rest remain.
verified with ovirt beta vdsm-4.16.0-42.git3bfad86.fc19.x86_64 no sign of "Error while sampling stats" in logs of heavily used setup for last few days
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html