Hide Forgot
Description of problem: I am losing connection to the hypervisor from time to time. Version-Release number of selected component (if applicable): How reproducible: I have seen this behaviour few times. Steps to Reproduce: 1. Start a VM on PXE booting and stop it on boot process 2. Switch off the VM from GUI using the stop button 3. After a while the host move to unresponsive Actual results: The VM does not stop and the hypervisor lose communication to the manager. Expected results: The VM stops and we do not lose communication to the manager Additional info: On RHEVM, the VM is being displayed up. On the hypervisor the VM is still running. If I look in vdsm.log, i can see the following : Thread-15::DEBUG::2011-10-12 21:06:39,723::taskManager::69::TaskManager::(prepareForShutdown) Request to stop all tasks Thread-15::DEBUG::2011-10-12 21:06:40,915::task::492::TaskManager.Task::(_debug) Task b7ee3a22-716a-4fe7-a745-c9502f7e112b: finished: None Thread-15::DEBUG::2011-10-12 21:06:40,915::task::492::TaskManager.Task::(_debug) Task b7ee3a22-716a-4fe7-a745-c9502f7e112b: moving from state preparing -> state finished Thread-15::DEBUG::2011-10-12 21:06:40,916::resourceManager::786::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-15::DEBUG::2011-10-12 21:06:40,916::resourceManager::821::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-15::DEBUG::2011-10-12 21:06:40,916::task::492::TaskManager.Task::(_debug) Task b7ee3a22-716a-4fe7-a745-c9502f7e112b: ref 0 aborting False Thread-15::INFO::2011-10-12 21:06:40,917::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: prepareForShutdown, Return response: {'status': {'message': 'OK', 'code': 0}} Thread-15::ERROR::2011-10-12 21:06:40,917::utils::369::vm.Vm::(collect) vmId=`728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x1a2eb20> Traceback (most recent call last): File "/usr/share/vdsm/utils.py", line 366, in collect statsFunction() File "/usr/share/vdsm/utils.py", line 250, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libviNrtvm.py", line 85, in _sampleCpu state, maxMem, memory, nrVirtCpu, cpuTime = self._vm._dom.info() File "/usr/share/vdsm/libvirtvm.py", line 327, in f ret = attr(*args, **kwargs) File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1406, in info if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self) libvirtError: Cannot write data: Broken pipe Thread-15::ERROR::2011-10-12 21:06:40,919::libvirtconnection::73::vds::(wrapper) connection to libvirt broken. taking vdsm down. Thread-15::DEBUG::2011-10-12 21:06:40,920::clientIF::145::vds::(prepareForShutdown) cannot run prepareForShutdown twice Thread-15::ERROR::2011-10-12 21:06:40,920::utils::369::vm.Vm::(collect) vmId=`728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe`::Stats function failed: <AdvancedStatsFunction _sampleNet at 0x1a2ec00> Traceback (most recent call last): File "/usr/share/vdsm/utils.py", line 366, in collect statsFunction() File "/usr/share/vdsm/utils.py", line 250, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libvirtvm.py", line 101, in _sampleNet netSamples[vmIface] = self._vm._dom.interfaceStats(vmIface) File "/usr/share/vdsm/libvirtvm.py", line 327, in f ret = attr(*args, **kwargs) File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1412, in interfaceStats if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self) libvirtError: Cannot write data: Broken pipe Thread-15::DEBUG::2011-10-12 21:06:40,921::utils::345::vm.Vm::(run) vmId=`728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe`::Stats thread finished In libvirtd.log, i can see the following 21:06:21.708: 30721: debug : virDomainInterfaceStats:6445 : dom=0x7f8e8c001de0, (VM: name=test, uuid=728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe), path=vnet0, stats=0x7f8e9e071330, size=64 21:06:26.710: 30721: debug : virDomainInterfaceStats:6445 : dom=0x7f8e8c001de0, (VM: name=test, uuid=728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe), path=vnet0, stats=0x7f8e9e071330, size=64 21:06:31.712: 30721: debug : virDomainInterfaceStats:6445 : dom=0x7f8e8c001de0, (VM: name=test, uuid=728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe), path=vnet0, stats=0x7f8e9e071330, size=64 21:06:36.714: 30721: debug : virDomainGetInfo:3636 : dom=0x7f8e8c001de0, (VM: name=test, uuid=728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe), info=0x7f8e9e071350 21:06:36.714: 30721: error : virNetSocketWriteWire:938 : Cannot write data: Broken pipe 21:06:40.919: 30721: debug : virDomainInterfaceStats:6445 : dom=0x7f8e8c001de0, (VM: name=test, uuid=728fb1e2-d2e9-40bc-8dea-0ff5ca31b3fe), path=vnet0, stats=0x7f8e9e071330, size=64 21:06:40.919: 30721: error : virNetSocketWriteWire:938 : Cannot write data: Broken pipe In messages, we can see that libvirt is being respawn. Oct 12 21:06:33 rh3 init: libvirtd main process (30609) killed by SEGV signal Oct 12 21:06:33 rh3 init: libvirtd main process ended, respawning If I restart the vdsmd service, my system is connecting back to RHEVM, the GUI is displaying my VM as running and I can use the GUI to kill it. If I check, libvirt and vdsm are running.
Created attachment 527883 [details] sosreport from the hypervisor
Hi Yaniv, Pls find the sosreport as well as the version for libvirt / vdsm. libvirt-0.9.4-12 vdsm-4.9-96.1 Regards, Olivier
libvirt-0.9.4-12? It is know to be broken. Do not use it. I bet you'd find it's cores stacked in /var/log/core. vdsm.log has the following, which makes me suspect you are hitting Bug 744755. Thread-33::INFO::2011-10-12 10:51:36,287::storage_mailbox::434::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail Thread-33::INFO::2011-10-12 10:51:36,287::storage_mailbox::330::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending mail to SPM - ['dd', 'of=/rhev/data-center/9a8843e8-eff2-11e0-b600-0017a4779c0c/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=8'] Thread-33::DEBUG::2011-10-12 10:51:36,288::storage_mailbox::334::Storage.Misc.excCmd::(_sendMail) 'dd of=/rhev/data-center/9a8843e8-eff2-11e0-b600-0017a4779c0c/mastersd/dom_md/inbox iflag=fullblock oflag=direct conv=notrunc bs=512 seek=8' (cwd None) Thread-33::DEBUG::2011-10-12 10:51:36,299::storage_mailbox::334::Storage.Misc.excCmd::(_sendMail) SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 0.00145234 s, 2.8 MB/s\n'; <rc> = 0 MainThread::INFO::2011-10-12 10:51:37,079::vdsm::76::vds::(run) VDSM main thread ended. Waiting for 27 other threads... MainThread::INFO::2011-10-12 10:51:37,084::vdsm::79::vds::(run) <_MainThread(MainThread, started 140293428688640)> *** This bug has been marked as a duplicate of bug 744755 ***