Hide Forgot
+++ This bug was initially created as a clone of Bug #1003588 +++ Description of problem: VM is not being automatically unpaused after I/O error. ISCSI environment. VM that goes to "paused" state because of I/O problems should be automatically unpaused when the I/O problem solved. It doesn't happen on ISCSI environment. Version-Release number of selected component (if applicable): is11. vdsm-4.12.0-72.git287bb7e.el6ev.x86_64 How reproducible: 100% Steps to Reproduce: 1.Create VM with thin provisioned disk on ISCSI storage. 2.Install OS on the VM. 3.connect to VM start to write data to its HD (make big file) 4.Block connection to SD from the SPM host while data is being writed to VMs HD. 5.See that VM goes to "paused' state. 6.Resume the connection from SPM host to Storage Domain. Actual results: VM stays in "paused" mode. Expected results: Once connection to SD resumed ,VM turns from "paused" to "active" Additional info: vdsm log attached. --- Additional comment from Elad on 2013-09-24 10:36:34 EDT --- This feature works for me. I checked on is16 --- Additional comment from Ayal Baron on 2013-10-09 08:09:22 EDT --- Auto resume depends on domain monitoring (failed domain coming back up causes VMs to be unpaused). The situation here looks like the domain monitoring for this domain stopped for some reason (so naturally the VM wouldn't be resumed. --- Additional comment from Katarzyna Jachim on 2013-10-16 06:05:10 EDT --- From what I have noticed: * automatic unpausing doesn't work if we have an environment with only one host and one storage domain and we block connection between them - in such case after unblocking the connection DC is eventually up, but VM stays paused * if we have a DC with one storage domain and two hosts and VM is running on non-SPM, unpausing work fine * if we have a DC with one storage domain and two hosts and VM is running on SPM, then automatic unpausing doesn't work --- Additional comment from Federico Simoncelli on 2013-11-15 06:10:56 EST --- Relevant log lines: MainThread::INFO::2013-09-02 15:06:10,303::vdsm::101::vds::(run) (PID: 1957) I am the actual vdsm 4.12.0-72.git287bb7e.el6ev purple-vds3.qa.lab.tlv.redhat.com (2.6.32-358.11.1.el6.x86_64) MainThread::INFO::2013-09-02 15:06:13,081::logUtils::44::dispatcher::(wrapper) Run and protect: registerDomainStateChangeCallback(callbackFunc=<bound method clientIF.contEIOVms of <clientIF.clientIF instance at 0x7f0f0d2c8128>>) Thread-22::INFO::2013-09-02 15:06:34,437::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='5849b030-626e-47cb-ad90-3ce782d831b3', hostID=1, scsiKey='5849b030-626e-47cb-ad90-3ce782d831b3', msdUUID='df72dcde-837d-4a82-8427-a7e77d2e239e', masterVersion=1, options=None) libvirtEventLoop::INFO::2013-09-02 15:06:34,563::vm::4142::vm.Vm::(_onAbnormalStop) vmId=`de8f50ba-de90-45c4-84c8-66cb5bb2ffd1`::abnormal vm stop device virtio-disk0 error eio Thread-28::DEBUG::2013-09-02 15:06:42,806::BindingXMLRPC::981::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Paused', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '29516', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '0', 'hash': '1654317978148629509', 'balloonInfo': {'balloon_max': '1048576', 'balloon_target': '1048576', 'balloon_cur': '1048576', 'balloon_min': '1048576'}, 'pauseCode': 'EIO', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet0': {'macAddr': '00:1a:4a:0e:00:a7', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmId': 'de8f50ba-de90-45c4-84c8-66cb5bb2ffd1', 'displayType': 'qxl', 'cpuUser': '-0.21', 'disks': {u'vda': {'truesize': '1073741824', 'apparentsize': '1073741824', 'imageID': '29b702bf-4830-4856-b800-fabba21b6361'}, u'hdc': {'truesize': '0', 'apparentsize': '0'}}, 'monitorResponse': '0', 'statsAge': '1.81', 'elapsedTime': '1093', 'vmType': 'kvm', 'cpuSys': '0.77', 'appsList': [], 'guestIPs': ''}]} Thread-170::DEBUG::2013-09-02 15:10:07,880::domainMonitor::151::Storage.DomainMonitorThread::(_monitorLoop) Starting domain monitor for df72dcde-837d-4a82-8427-a7e77d2e239e MainThread::INFO::2013-09-02 15:28:52,873::vdsm::101::vds::(run) (PID: 15010) I am the actual vdsm 4.12.0-72.git287bb7e.el6ev purple-vds3.qa.lab.tlv.redhat.com (2.6.32-358.11.1.el6.x86_64) MainThread::INFO::2013-09-02 15:28:54,055::logUtils::44::dispatcher::(wrapper) Run and protect: registerDomainStateChangeCallback(callbackFunc=<bound method clientIF.contEIOVms of <clientIF.clientIF instance at 0x7f8827ea5128>>) Thread-86::INFO::2013-09-02 15:30:40,432::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='5849b030-626e-47cb-ad90-3ce782d831b3', hostID=1, scsiKey='5849b030-626e-47cb-ad90-3ce782d831b3', msdUUID='df72dcde-837d-4a82-8427-a7e77d2e239e', masterVersion=1, options=None) Thread-89::DEBUG::2013-09-02 15:30:48,200::domainMonitor::151::Storage.DomainMonitorThread::(_monitorLoop) Starting domain monitor for df72dcde-837d-4a82-8427-a7e77d2e239e Thread-86::INFO::2013-09-02 15:30:49,698::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True It seems to me that when the host is the SPM and the blocked domain is the master what happens is: - VM is up - connection to master domain is blocked - vdsm is restarted (in order to release the SPM role) - domain state change callback is registered (but the pool is not connected) - connectStoragePool keeps failing - the connection to the master domain is unblocked - connectStoragePool succeeds but there is event about the domains being accessible I think this goes back to the old discussion about having a "domain is up" event also on connection. With regard to comment 3, I'd add another test: - one host only (SPM), two storage domains, block the connection to the non-master domain In such case the resume should work with the current code. I also think that this issue is not limited to ISCSI. ------------------------------ This bug was solved in oVirt 3.3, and rebroken in master.
oVirt 3.4.0 alpha has been released including the fix for this issue.
This is an automated message oVirt 3.4.1 has been released: * should fix your issue * should be available at your local mirror within two days. If problems still persist, please make note of it in this bug report.