Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 955432

Summary: vdsmd is unstable on a hypervisor and crashes frequently
Product: Red Hat Enterprise Virtualization Manager Reporter: Kevein Liu <yaliu>
Component: vdsmAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED DUPLICATE QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.1.3CC: abaron, bazulay, hateya, iheim, lpeer, michal.skrivanek, ybronhei, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-04-23 08:50:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kevein Liu 2013-04-23 03:48:30 UTC
Description of problem:
vdsmd is unstable on a hypervisor and crashes frequently.
So it causes rhevm lost connection with hypervisor and all the VMs went to non-response status. vdsmd seems was killed by SEVG signal:
~~~
Apr 22 10:24:26 h-p-27 kernel: vdsm[39698]: segfault at 20 ip 00000036dac09220 sp 00007f7ae61f9da8 error 4 in libpthread-2.12.so[36dac00000+17000]
~~~
There are also other useful information:
~~~
Apr 22 15:02:12 h-p-27 abrt: detected unhandled Python exception in '/usr/share/vdsm/nwfilter.pyc'
Apr 22 15:02:12 h-p-27 abrt: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory
Apr 22 15:02:12 h-p-27 init: libvirtd main process (34563) killed by SEGV signal
Apr 22 15:02:12 h-p-27 init: libvirtd main process ended, respawning
~~~
~~~
Thread-266::ERROR::2013-04-22 14:06:36,837::utils::416::vm.Vm::(collect) vmId=`5e614c36-37e0-42ea-a7bc-9c65bb2c977b`::Stats function failed: <AdvancedStatsFunction _sampleNet at 0x15cd298>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 168, in _sampleNet
    netSamples[nic.name] = self._vm._dom.interfaceStats(nic.name)
  File "/usr/share/vdsm/libvirtvm.py", line 515, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1867, in interfaceStats
    if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self)
libvirtError: internal error received hangup / error event on socket
Thread-194::ERROR::2013-04-22 14:06:36,838::utils::416::vm.Vm::(collect) vmId=`c19ea273-40b3-4df1-92e8-d343bc0c03d9`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x15cd1b8>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 137, in _sampleCpu
    cpuStats = self._vm._dom.getCPUStats(True, 0)
  File "/usr/share/vdsm/libvirtvm.py", line 515, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1849, in getCPUStats
    if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self)
libvirtError: internal error client socket is closed
~~~~

Version-Release number of selected component (if applicable):
* libvirt-0.10.2-18.el6_4.2.x86_64
* vdsm-4.10.2-1.8.el6ev.x86_64
* rhevm-3.1.0-43.el6ev.noarch

How reproducible:
Since the issue happened occasionally, so we don't know how to reproduce it at this time.

Steps to Reproduce:
N/A
  
Actual results:
vdsmd and libvirtd crashed, and lost connection with hypervisor

Expected results:
vdsmd and libvirtd running well.

Additional info:
Customer has uploaded the Log-collector to our ftp server:
ftp://dropbox.redhat.com/sosreport-LogCollector-m3-20130422141119-993b.tar.xz

Comment 1 Yaniv Bronhaim 2013-04-23 07:50:19 UTC
Bug 951576 seems to refer to the same issue, I prefer to see vdsm log to verify that its the same case, but as the exception you copied looks like, the problem is that vdsm doesn't perform self fencing when the socket starts to be broken (probably because the same libvirt keepalive or SIGABRT bugs). The patch that fixes Bug 951576 should solve this case too.

Please attach the host's vdsm log.

Comment 2 Michal Skrivanek 2013-04-23 08:50:13 UTC
actually, you need both bug 951576 and bug 951576. They are now fixed, please reopen if attached vdsm.log proves other issues

*** This bug has been marked as a duplicate of bug 951576 ***