Description of problem: vdsmd is unstable on a hypervisor and crashes frequently. So it causes rhevm lost connection with hypervisor and all the VMs went to non-response status. vdsmd seems was killed by SEVG signal: ~~~ Apr 22 10:24:26 h-p-27 kernel: vdsm[39698]: segfault at 20 ip 00000036dac09220 sp 00007f7ae61f9da8 error 4 in libpthread-2.12.so[36dac00000+17000] ~~~ There are also other useful information: ~~~ Apr 22 15:02:12 h-p-27 abrt: detected unhandled Python exception in '/usr/share/vdsm/nwfilter.pyc' Apr 22 15:02:12 h-p-27 abrt: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory Apr 22 15:02:12 h-p-27 init: libvirtd main process (34563) killed by SEGV signal Apr 22 15:02:12 h-p-27 init: libvirtd main process ended, respawning ~~~ ~~~ Thread-266::ERROR::2013-04-22 14:06:36,837::utils::416::vm.Vm::(collect) vmId=`5e614c36-37e0-42ea-a7bc-9c65bb2c977b`::Stats function failed: <AdvancedStatsFunction _sampleNet at 0x15cd298> Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect statsFunction() File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libvirtvm.py", line 168, in _sampleNet netSamples[nic.name] = self._vm._dom.interfaceStats(nic.name) File "/usr/share/vdsm/libvirtvm.py", line 515, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1867, in interfaceStats if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self) libvirtError: internal error received hangup / error event on socket Thread-194::ERROR::2013-04-22 14:06:36,838::utils::416::vm.Vm::(collect) vmId=`c19ea273-40b3-4df1-92e8-d343bc0c03d9`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x15cd1b8> Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect statsFunction() File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libvirtvm.py", line 137, in _sampleCpu cpuStats = self._vm._dom.getCPUStats(True, 0) File "/usr/share/vdsm/libvirtvm.py", line 515, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1849, in getCPUStats if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self) libvirtError: internal error client socket is closed ~~~~ Version-Release number of selected component (if applicable): * libvirt-0.10.2-18.el6_4.2.x86_64 * vdsm-4.10.2-1.8.el6ev.x86_64 * rhevm-3.1.0-43.el6ev.noarch How reproducible: Since the issue happened occasionally, so we don't know how to reproduce it at this time. Steps to Reproduce: N/A Actual results: vdsmd and libvirtd crashed, and lost connection with hypervisor Expected results: vdsmd and libvirtd running well. Additional info: Customer has uploaded the Log-collector to our ftp server: ftp://dropbox.redhat.com/sosreport-LogCollector-m3-20130422141119-993b.tar.xz
Bug 951576 seems to refer to the same issue, I prefer to see vdsm log to verify that its the same case, but as the exception you copied looks like, the problem is that vdsm doesn't perform self fencing when the socket starts to be broken (probably because the same libvirt keepalive or SIGABRT bugs). The patch that fixes Bug 951576 should solve this case too. Please attach the host's vdsm log.
actually, you need both bug 951576 and bug 951576. They are now fixed, please reopen if attached vdsm.log proves other issues *** This bug has been marked as a duplicate of bug 951576 ***