Bug 1113948
| Summary: | Vdsm sampling threads unexpectingly stops with IOError ENODEV | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> | |
| Component: | vdsm | Assignee: | Barak <bazulay> | |
| Status: | CLOSED ERRATA | QA Contact: | Martin Pavlik <mpavlik> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 3.4.0 | CC: | akotov, bazulay, danken, dossow, fromani, iheim, jcoscia, lpeer, luvilla, myakove, nyechiel, oourfali, pzhukov, yeylon | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | 3.5.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | network | |||
| Fixed In Version: | ovirt-3.5.0-beta1.1 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, when a device was removed while sampling network device data, it would raise an Exception, invalidating the whole sampling. Now, when failing to retrieve data for a network device due to it being removed, it is deleted from the output so that the rest of the sampling can continue.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1129377 (view as bug list) | Environment: | ||
| Last Closed: | 2015-02-11 21:11:51 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1129377, 1142923, 1156165 | |||
Possible cause:
m3-isp03-2014062608111403770284/var/log/vdsm/vdsm.log
46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats
46760 Traceback (most recent call last):
46761 File "/usr/share/vdsm/sampling.py", line 412, in run
46762 File "/usr/share/vdsm/sampling.py", line 401, in sample
46763 File "/usr/share/vdsm/sampling.py", line 388, in _updateIfidsIfrates
46764 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 286, in getLinks
46765 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 147, in fromDict
46766 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 172, in _detectType
46767 File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 268, in _drvinfo
46768 IOError: [Errno 19] No such device
46769 VM Channels Listener::ERROR::2014-06-26 07:27:51,421::vmChannels::53::vds::(_handle_event) Received 00000011 on fileno 165
We got IOError ENODEV exception which is not caught. We catch IOError but raise
it again it the exception not EOPNOTSUPP.
lib/vdsm/ipwrapper.py:
171 try:
172 driver = _drvinfo(name)
173 except IOError as ioe:
174 if ioe.errno == errno.EOPNOTSUPP:
175 if name == 'lo':
176 detectedType = LinkType.LOOPBACK
177 else:
178 detectedType = LinkType.DUMMY
179 return detectedType
180 else:
181 raise # Reraise other errors like ENODEV <=========
There is no other except section on the way up till the run method where the thread
body is running. As this is not vm.TimeoutError, the exception is moving up and
is caught out of the main while loop so the thread ends here.
def run(self):
406 import vm
407 try:
408 # wait a bit before starting to sample
409 time.sleep(self.SAMPLE_INTERVAL_SEC)
410 while not self._stopEvent.isSet(): <=== While loop
411 try:
412 sample = self.sample()
413 self._samples.append(sample)
414 self._lastSampleTime = sample.timestamp
415 if len(self._samples) > self.AVERAGING_WINDOW:
416 self._samples.pop(0)
417 except vm.TimeoutError:
418 self._log.error("Timeout while sampling stats",
419 exc_info=True)
420 self._stopEvent.wait(self.SAMPLE_INTERVAL_SEC)
421 except: <=== We get the exception out of the loop
422 if not self._stopEvent.isSet():
423 self._log.error("Error while sampling stats", exc_info=True)
We get the links here:
lib/vdsm/ipwrapper.py:
284 def getLinks():
285 """Returns a list of Link objects for each link in the system."""
286 return [Link.fromDict(data) for data in netlink.iter_links()]
netlink.iter_links() should return all devices including the vnet devices. I am
now just guessing but it is possible a VM was powered off just after we got
the list of devices and thus the device disappeared.
We see that VM 576c0b26-25ea-4884-bd07-f810ffaff87a was killed just about the
time when we got the exception.
vdsm.log:
46755 Thread-8080::WARNING::2014-06-26 07:27:50,046::vm::822::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration stalling: remaining (709MiB) > lowmark (632MiB). Refer to RHBZ#919201.
46756 Thread-8080::INFO::2014-06-26 07:27:50,047::vm::834::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration Progress: 160 seconds elapsed, 92% of data processed, 92% of mem processed
...
46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats
...
46772 VM Channels Listener::DEBUG::2014-06-26 07:27:51,433::guestIF::208::vm.Vm::(_connect) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Failed to connect to /var/lib/libvirt/qemu/channels/576c0b26-25ea-4884-bd07-f810ffaff87a.com.redha 46772 t.rhevm.vdsm with 111
...
46782 libvirtEventLoop::DEBUG::2014-06-26 07:27:52,055::vm::5152::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::event Stopped detail 3 opaque None
On top of the network issue, the sampling code should not give up so easily. I'll address this with a patch. I'll change the code so that if a device goes missing it is just deleted from the output but the rest remain. verified with ovirt beta vdsm-4.16.0-42.git3bfad86.fc19.x86_64 no sign of "Error while sampling stats" in logs of heavily used setup for last few days Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html |
Description of problem: The vdsm sampling thread stops collecting statistics. This causes the host statistics (CPU, Memory, Network) in the WebAdmin portal to get frozen. Version-Release number of selected component (if applicable): vdsm-4.14.7-3.el6ev.x86_64 How reproducible: sometimes. It is question of timing Steps to Reproduce: 1. Stop VM on the hypervisor Actual results: Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats Traceback (most recent call last): File "/usr/share/vdsm/sampling.py", line 412, in run File "/usr/share/vdsm/sampling.py", line 401, in sample File "/usr/share/vdsm/sampling.py", line 388, in _updateIfidsIfrates File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 286, in getLinks File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 147, in fromDict File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 172, in _detectType File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 268, in _drvinfo IOError: [Errno 19] No such device Expected results: The sampling thread remains runing.