+++ This bug was initially created as a clone of Bug #1063336 +++ Description of problem: Environment: - Setup with 2 vms (1 disk each), 2 storage domains. - Each vm has a disk on a different SD. - First vm with snapshot. - second vm has attached snapshot disk of first vm (through backup api) when blocking connection from host that vms are running on to the storage domain that contains the first vm's disk (i.e. the vm with the snapshot), the second vm should be paused due to EIO. the problem is, when unblocking the connection and trying to run the paused vm, it fails with error. * In vdsm logs, when trying to resume the paused vm also fails with error: --> * Powering off the paused vm works, but then trying to start it fails with the following error: 'Changed state to Down: Requested operation is not valid: domain 'v m_0' is already active'. output **before** shutting the paused vm: # vdsClient -s 0 list ba8a2610-aa49-4505-ab12-ca6b9ac724a3 19207 vm_0 Paused * After powering off the paused vm, libvirt still shows *both* vms as running: # virsh -r list Id Name State ---------------------------------------------------- 27 vm_0 running 29 vm_1 running However, vdsm reports that there are no running vms, and so does the engine/UI # vdsClient -s 0 list None Restarting vdsm on the host solves the problem (vdsm kills both qemu process after recovering from the restart) Version-Release number of selected component (if applicable): vdsm-4.14.1-2.el6.x86_64 libvirt-0.10.2-29.el6.1.x86_64 ovirt-engine-3.4.0-0.5.beta1.el6.noarch How reproducible: 100% Steps to Reproduce: Environment: - Setup with 2 vms (1 disk each), 2 storage domains (on same storage server). - Each vm has a disk on a different SD. - First vm with snapshot. - second vm has snapshot of first vm's disk attached (through backup api) 1. Run both vms on same host 2. Block connection from host to storage domain containing first vm's disk 3. both vms should enter Paused state 4. Unblock connection 5. Try to resume vms or power off and rerun the vms Actual results: Step 5 fails as described above Expected results: vms should recover from paused state when connection was unblocked Additional info: --- Additional comment from Francesco Romani on 2014-02-10 12:33:07 EST --- Hi, are the attached one the complete logs? Could you please provide some timing information of the steps you executed: - at which time the connection was unblocked? - at which time you try to resume the VMs? --- Additional comment from on 2014-02-12 04:57:22 EST --- --- Additional comment from on 2014-02-12 04:57:44 EST --- Hi, I attached new clean logs: - connection was blocked at: 11:29:39 - VMs were paused at: 11:35:45 - connection was unblocked at: 11:36 - resume the VMs at: 11:37:56 - power off VMs: 11:38:28 - rerun VMs: 11:39:41 --- Additional comment from Francesco Romani on 2014-02-12 05:23:42 EST --- Thanks, the information you provided are very helpful and helped me to quickly find the problem. The root cause of the issue is the following: * if, in the recovery path, VDSM is unable to access any of the volumes, it raises an exception (see [1] below). * The exception is handled, but the lookup of the domain in libvirt is skipped (it should happen just after the buildConfDevices step). * as result, a Vm object is wrongly registered inside VDSM but without the established connection to libvirt, which is represented by its _dom attribute. * this in turn causes the failure of the vmCont operation (see [2] below) and the disalignment of the state between VDSM and libvirt. Log exceperts demonstrating the above: [1] Thread-13::ERROR::2014-02-12 11:32:34,718::vm::1991::vm.Vm::(_normalizeVdsmImg) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::Unable to get volume size for ede4f955-549c-4f97-b103-127ee2b3672d Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 1987, in _normalizeVdsmImg drv['truesize'] = res['truesize'] KeyError: 'truesize' Thread-13::INFO::2014-02-12 11:32:34,719::vm::2244::vm.Vm::(_startUnderlyingVm) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::Skipping errors on recovery Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 2233, in _startUnderlyingVm self._run() File "/usr/share/vdsm/vm.py", line 3093, in _run devices = self.buildConfDevices() File "/usr/share/vdsm/vm.py", line 2061, in buildConfDevices self._normalizeVdsmImg(drv) File "/usr/share/vdsm/vm.py", line 1993, in _normalizeVdsmImg drv['volumeID']) RuntimeError: Volume ede4f955-549c-4f97-b103-127ee2b3672d is corrupted or missing Thread-13::DEBUG::2014-02-12 11:32:34,720::vm::2247::vm.Vm::(_startUnderlyingVm) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::_ongoingCreations released [2] Thread-166::DEBUG::2014-02-12 11:37:56,327::BindingXMLRPC::970::vds::(wrapper) client [10.35.161.54]::call vmCont with ('d200dff9-5656-402f-8cc4-9e1a5c68e938',) {} flowID [28e177e] Thread-166::ERROR::2014-02-12 11:37:56,327::BindingXMLRPC::989::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 973, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 209, in vmCont return vm.cont() File "/usr/share/vdsm/API.py", line 154, in cont return v.cont() File "/usr/share/vdsm/vm.py", line 2587, in cont self._underlyingCont() File "/usr/share/vdsm/vm.py", line 3726, in _underlyingCont hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf) AttributeError: 'NoneType' object has no attribute 'XMLDesc' --- Additional comment from Francesco Romani on 2014-02-12 06:13:17 EST --- Worth to remark that VDSM was restarted while the storage was blocked/unavailble (following the times you provided) and then triggered the recovery and all the following: MainThread::DEBUG::2014-02-12 11:30:45,946::vdsm::55::vds::(sigtermHandler) Received signal 15 MainThread::DEBUG::2014-02-12 11:30:45,946::vdsm::55::vds::(sigtermHandler) Received signal 15 MainThread::INFO::2014-02-12 11:30:46,898::vmChannels::183::vds::(stop) VM channels listener was stopped. MainThread::INFO::2014-02-12 11:30:46,898::momIF::84::MOM::(stop) Shutting down MOM MainThread::INFO::2014-02-12 11:30:46,899::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None) MainThread::DEBUG::2014-02-12 11:30:46,901::sp::380::Storage.StoragePool::(cleanupMasterMount) master `/rhev/data-center/mnt/blockSD/01a745ee-26ee-4e73-a99a-05e791e9a5eb/master` is not mounted, skipping MainThread::DEBUG::2014-02-12 11:30:46,902::sp::380::Storage.StoragePool::(cleanupMasterMount) master `/rhev/data-center/mnt/blockSD/74ee9bdc-d6cd-49af-bbd4-4e8e2a009a25/master` is not mounted, skipping MainThread::DEBUG::2014-02-12 11:30:46,947::vdsm::55::vds::(sigtermHandler) Received signal 15 MainThread::DEBUG::2014-02-12 11:30:46,947::clientIF::207::vds::(prepareForShutdown) cannot run prepareForShutdown concurrently MainThread::INFO::2014-02-12 11:30:47,294::domainMonitor::125::Storage.DomainMonitor::(close) Stopping domain monitors MainThread::INFO::2014-02-12 11:30:47,295::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 13f48873-6f42-48fe-a4f5-1fa3c19a6f39 Thread-36::DEBUG::2014-02-12 11:30:47,295::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 13f48873-6f42-48fe-a4f5-1fa3c19a6f39 VM Channels Listener::INFO::2014-02-12 11:30:47,595::vmChannels::178::vds::(run) VM channels listener thread has ended. Thread-37::DEBUG::2014-02-12 11:30:53,137::domainMonitor::192::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 0639c74b-0a19-47c4-919a-105902edefbc MainThread::INFO::2014-02-12 11:31:27,034::vdsm::119::vds::(run) (PID: 10360) I am the actual vdsm 4.14.1-2.el6 aqua-vds5.qa.lab.tlv.redhat.com (2.6.32-431.1.2.el6.x86_64) Lowering the proprity --- Additional comment from Francesco Romani on 2014-02-13 03:41:39 EST --- draft patch posted on gerrit
backport of the fix to branch 3.3
This is an automated message. Moving to Closed CURRENTRELEASE since oVirt 3.3.4 has been released.