Bug 955432 - vdsmd is unstable on a hypervisor and crashes frequently
Summary: vdsmd is unstable on a hypervisor and crashes frequently
Keywords:
Status: CLOSED DUPLICATE of bug 951576
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.1.3
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Nobody's working on this, feel free to take it
QA Contact:
URL:
Whiteboard: network
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-23 03:48 UTC by Kevein Liu
Modified: 2018-12-01 15:24 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-04-23 08:50:13 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Kevein Liu 2013-04-23 03:48:30 UTC
Description of problem:
vdsmd is unstable on a hypervisor and crashes frequently.
So it causes rhevm lost connection with hypervisor and all the VMs went to non-response status. vdsmd seems was killed by SEVG signal:
~~~
Apr 22 10:24:26 h-p-27 kernel: vdsm[39698]: segfault at 20 ip 00000036dac09220 sp 00007f7ae61f9da8 error 4 in libpthread-2.12.so[36dac00000+17000]
~~~
There are also other useful information:
~~~
Apr 22 15:02:12 h-p-27 abrt: detected unhandled Python exception in '/usr/share/vdsm/nwfilter.pyc'
Apr 22 15:02:12 h-p-27 abrt: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory
Apr 22 15:02:12 h-p-27 init: libvirtd main process (34563) killed by SEGV signal
Apr 22 15:02:12 h-p-27 init: libvirtd main process ended, respawning
~~~
~~~
Thread-266::ERROR::2013-04-22 14:06:36,837::utils::416::vm.Vm::(collect) vmId=`5e614c36-37e0-42ea-a7bc-9c65bb2c977b`::Stats function failed: <AdvancedStatsFunction _sampleNet at 0x15cd298>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 168, in _sampleNet
    netSamples[nic.name] = self._vm._dom.interfaceStats(nic.name)
  File "/usr/share/vdsm/libvirtvm.py", line 515, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1867, in interfaceStats
    if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self)
libvirtError: internal error received hangup / error event on socket
Thread-194::ERROR::2013-04-22 14:06:36,838::utils::416::vm.Vm::(collect) vmId=`c19ea273-40b3-4df1-92e8-d343bc0c03d9`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x15cd1b8>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 137, in _sampleCpu
    cpuStats = self._vm._dom.getCPUStats(True, 0)
  File "/usr/share/vdsm/libvirtvm.py", line 515, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1849, in getCPUStats
    if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self)
libvirtError: internal error client socket is closed
~~~~

Version-Release number of selected component (if applicable):
* libvirt-0.10.2-18.el6_4.2.x86_64
* vdsm-4.10.2-1.8.el6ev.x86_64
* rhevm-3.1.0-43.el6ev.noarch

How reproducible:
Since the issue happened occasionally, so we don't know how to reproduce it at this time.

Steps to Reproduce:
N/A
  
Actual results:
vdsmd and libvirtd crashed, and lost connection with hypervisor

Expected results:
vdsmd and libvirtd running well.

Additional info:
Customer has uploaded the Log-collector to our ftp server:
ftp://dropbox.redhat.com/sosreport-LogCollector-m3-20130422141119-993b.tar.xz

Comment 1 Yaniv Bronhaim 2013-04-23 07:50:19 UTC
Bug 951576 seems to refer to the same issue, I prefer to see vdsm log to verify that its the same case, but as the exception you copied looks like, the problem is that vdsm doesn't perform self fencing when the socket starts to be broken (probably because the same libvirt keepalive or SIGABRT bugs). The patch that fixes Bug 951576 should solve this case too.

Please attach the host's vdsm log.

Comment 2 Michal Skrivanek 2013-04-23 08:50:13 UTC
actually, you need both bug 951576 and bug 951576. They are now fixed, please reopen if attached vdsm.log proves other issues

*** This bug has been marked as a duplicate of bug 951576 ***


Note You need to log in before you can comment on or make changes to this bug.