Bug 1113948 - Vdsm sampling threads unexpectingly stops with IOError ENODEV
Summary: Vdsm sampling threads unexpectingly stops with IOError ENODEV
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.5.0
Assignee: Barak
QA Contact: Martin Pavlik
URL:
Whiteboard: network
Depends On:
Blocks: 1129377 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-06-27 10:14 UTC by Roman Hodain
Modified: 2019-06-13 08:00 UTC (History)
14 users (show)

Fixed In Version: ovirt-3.5.0-beta1.1
Doc Type: Bug Fix
Doc Text:
Previously, when a device was removed while sampling network device data, it would raise an Exception, invalidating the whole sampling. Now, when failing to retrieve data for a network device due to it being removed, it is deleted from the output so that the rest of the sampling can continue.
Clone Of:
: 1129377 (view as bug list)
Environment:
Last Closed: 2015-02-11 21:11:51 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 974043 0 None None None Never
Red Hat Product Errata RHBA-2015:0159 0 normal SHIPPED_LIVE vdsm 3.5.0 - bug fix and enhancement update 2015-02-12 01:35:58 UTC
oVirt gerrit 29442 0 master MERGED ipwrapper: do not report devices that disappear Never
oVirt gerrit 29546 0 ovirt-3.5 MERGED ipwrapper: do not report devices that disappear Never

Description Roman Hodain 2014-06-27 10:14:50 UTC
Description of problem:
The vdsm sampling thread stops collecting statistics. This causes the host
statistics (CPU, Memory, Network) in the WebAdmin portal to get frozen.

Version-Release number of selected component (if applicable):
     vdsm-4.14.7-3.el6ev.x86_64

How reproducible:
     sometimes. It is question of timing

Steps to Reproduce:
	1. Stop VM on the hypervisor 

Actual results:

	Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats
	Traceback (most recent call last):
	File "/usr/share/vdsm/sampling.py", line 412, in run
	File "/usr/share/vdsm/sampling.py", line 401, in sample
	File "/usr/share/vdsm/sampling.py", line 388, in _updateIfidsIfrates
	File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 286, in getLinks
	File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 147, in fromDict
	File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 172, in _detectType
	File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 268, in _drvinfo
	IOError: [Errno 19] No such device

Expected results:
	
	The sampling thread remains runing.

Comment 1 Roman Hodain 2014-06-27 10:20:35 UTC
Possible cause:

m3-isp03-2014062608111403770284/var/log/vdsm/vdsm.log
  46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats
  46760 Traceback (most recent call last):
  46761   File "/usr/share/vdsm/sampling.py", line 412, in run
  46762   File "/usr/share/vdsm/sampling.py", line 401, in sample
  46763   File "/usr/share/vdsm/sampling.py", line 388, in _updateIfidsIfrates
  46764   File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 286, in getLinks
  46765   File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 147, in fromDict
  46766   File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 172, in _detectType
  46767   File "/usr/lib64/python2.6/site-packages/vdsm/ipwrapper.py", line 268, in _drvinfo
  46768 IOError: [Errno 19] No such device
  46769 VM Channels Listener::ERROR::2014-06-26 07:27:51,421::vmChannels::53::vds::(_handle_event) Received 00000011 on fileno 165

We got IOError ENODEV exception which is not caught. We catch IOError but raise
it again it the exception not EOPNOTSUPP.

lib/vdsm/ipwrapper.py:

171         try:                                                                    
172             driver = _drvinfo(name)                                             
173         except IOError as ioe:                                                  
174             if ioe.errno == errno.EOPNOTSUPP:                                   
175                 if name == 'lo':                                                
176                     detectedType = LinkType.LOOPBACK                            
177                 else:                                                           
178                     detectedType = LinkType.DUMMY                               
179                 return detectedType                                             
180             else:                                                               
181                 raise  # Reraise other errors like ENODEV             <=========

There is no other except section on the way up till the run method where the thread
body is running. As this is not vm.TimeoutError, the exception is moving up and
is caught out of the main while loop so the thread ends here.

def run(self):                                                              
406         import vm                                                               
407         try:                                                                    
408             # wait a bit before starting to sample                              
409             time.sleep(self.SAMPLE_INTERVAL_SEC)                                
410             while not self._stopEvent.isSet():                      <=== While loop            
411                 try:                                                            
412                     sample = self.sample()                                      
413                     self._samples.append(sample)                                
414                     self._lastSampleTime = sample.timestamp                     
415                     if len(self._samples) > self.AVERAGING_WINDOW:              
416                         self._samples.pop(0)                                    
417                 except vm.TimeoutError:                                         
418                     self._log.error("Timeout while sampling stats",             
419                                     exc_info=True)                              
420                 self._stopEvent.wait(self.SAMPLE_INTERVAL_SEC)                  
421         except:                                                     <=== We get the exception out of the loop
422             if not self._stopEvent.isSet():                                     
423                 self._log.error("Error while sampling stats", exc_info=True)

We get the links here:

lib/vdsm/ipwrapper.py:

284 def getLinks():                                                                 
285     """Returns a list of Link objects for each link in the system."""           
286     return [Link.fromDict(data) for data in netlink.iter_links()]  

netlink.iter_links() should return all devices including the vnet devices. I am
now just guessing but it is possible a VM was powered off just after we got
the list of devices and thus the device disappeared.

We see that VM 576c0b26-25ea-4884-bd07-f810ffaff87a was killed just about the
time when we got the exception.

vdsm.log:

 46755 Thread-8080::WARNING::2014-06-26 07:27:50,046::vm::822::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration stalling: remaining (709MiB) > lowmark (632MiB). Refer to RHBZ#919201.
 46756 Thread-8080::INFO::2014-06-26 07:27:50,047::vm::834::vm.Vm::(run) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Migration Progress: 160 seconds elapsed, 92% of data processed, 92% of mem processed
...
 46759 Thread-12::ERROR::2014-06-26 07:27:51,106::sampling::423::vds::(run) Error while sampling stats
...
  46772 VM Channels Listener::DEBUG::2014-06-26 07:27:51,433::guestIF::208::vm.Vm::(_connect) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::Failed to connect to /var/lib/libvirt/qemu/channels/576c0b26-25ea-4884-bd07-f810ffaff87a.com.redha  46772 t.rhevm.vdsm with 111
...
  46782 libvirtEventLoop::DEBUG::2014-06-26 07:27:52,055::vm::5152::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`576c0b26-25ea-4884-bd07-f810ffaff87a`::event Stopped detail 3 opaque None

Comment 2 Francesco Romani 2014-06-30 10:54:06 UTC
On top of the network issue, the sampling code should not give up so easily. I'll address this with a patch.

Comment 3 Antoni Segura Puimedon 2014-06-30 13:10:56 UTC
I'll change the code so that if a device goes missing it is just deleted from the output but the rest remain.

Comment 4 Martin Pavlik 2014-07-29 08:59:22 UTC
verified with ovirt beta vdsm-4.16.0-42.git3bfad86.fc19.x86_64


no sign of "Error while sampling stats" in logs of heavily used setup for last few days

Comment 9 errata-xmlrpc 2015-02-11 21:11:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html


Note You need to log in before you can comment on or make changes to this bug.