Created attachment 649110 [details] logs Description of problem: we fail to cancel migration because by the time engine sends the command to vdsm libvirt already destroyed the domain in the src. VM Mig is down. Exit message: Domain not found: no domain with matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'. Version-Release number of selected component (if applicable): si24.4 vdsm-4.9.6-44.0.el6_3.x86_64 libvirt-0.9.10-21.el6_3.6.x86_64 How reproducible: race Steps to Reproduce: 1. send cancel migration right when the vm finished migrating in vdsm 2. 3. Actual results: we fail to cancel migration with "domain does not exist" error. Expected results: if we cannot solve the race I suggest creating a better message for the user. Additional info: logs engine: 2012-11-21 10:10:56,624 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-90) Error code noVM and error message VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist 2012-11-21 10:10:56,625 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-90) Command org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand return value Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc mStatus Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc mCode 1 mMessage Virtual machine does not exist vdsm: libvirtEventLoop::DEBUG::2012-11-21 10:10:53,978::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/service ksmtuned retune' (cwd None) Thread-35620::ERROR::2012-11-21 10:10:53,989::vm::631::vm.Vm::(_startUnderlyingVm) vmId=`a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 611, in _startUnderlyingVm self._waitForIncomingMigrationFinish() File "/usr/share/vdsm/libvirtvm.py", line 1650, in _waitForIncomingMigrationFinish self._connection.lookupByUUIDString(self.id), File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2682, in lookupByUUIDString if ret is None:raise libvirtError('virDomainLookupByUUIDString() failed', conn=self) libvirtError: Domain not found: no domain with matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e' Thread-35620::DEBUG::2012-11-21 10:10:53,995::vm::969::vm.Vm::(setDownStatus) vmId=`a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e`::Changed state to Down: Domain not found: no domain wi th matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'
we should do better for the message. Either change VDSM to report meaningful user-oriented explanation (like "migration most likely failed") or add engine's interpretation on top?
its not a perfect solution but maybe the least we can do. just to add: 1. its defiantly not a race we should solve (we shouldn't stop the world on that one) 2. cross-component error reporting - returned errors are not parametrized (just a string) and therefore cannot be interpolated, I18N or be parsed in a reliable way to have logic made upon so we are left with concatenating messages upon them.
(In reply to comment #1) > we should do better for the message. Either change VDSM to report meaningful > user-oriented explanation (like "migration most likely failed") or add > engine's interpretation on top? But migration did succeed in this case. There are two cases when libvirt will provide this error: 1. The source VM had crushed. 2. The migration just completed. In both cases the cancel migration has failed. What needs to be done is for the backend, on error query both hosts for this VMs. If the VM found on the destination, all is well, print could not cancel migration. If the VM not found on both, it's actually VM failure and should be reported as such and if required given rerun treatment.
(In reply to comment #3) > (In reply to comment #1) > > we should do better for the message. Either change VDSM to report meaningful > > user-oriented explanation (like "migration most likely failed") or add > > engine's interpretation on top? > > But migration did succeed in this case. > > There are two cases when libvirt will provide this error: > 1. The source VM had crushed. > 2. The migration just completed. > > In both cases the cancel migration has failed. What needs to be done is for > the backend, on error query both hosts for this VMs. > > If the VM found on the destination, all is well, print could not cancel > migration. the cancel migration action doesn't have a context and the VM dynamic itself is cleaned very quickly so I cant tell the destination host for sure. > If the VM not found on both, it's actually VM failure and should be reported > as such and if required given rerun treatment.
(In reply to comment #4) > the cancel migration action doesn't have a context and the VM dynamic itself > is cleaned very quickly so I cant tell the destination host for sure. I'll rephrase: - If cancel migration is sent, and the VM is not on the source host, but is found on another host - Cancel migration failed. (Hide the domain not found error) - If cancel migration is send and VM not found on any host - it's actually VM failure and should be reported as such and if required given rerun treatment. An alternative, have the VDSM register the migration succeed event and do not forward the error.
> - If cancel migration is sent, and the VM is not on the source host, but is > found on another host - Cancel migration failed. (Hide the domain not found > error) > - If cancel migration is send and VM not found on any host - it's actually > VM failure and should be reported as such and if required given rerun > treatment. its problematic, the VM could be already during the rerun process and the current behaviour is not cancelling the engine migration process but only the target host migration task. So in this case the error that the migration failed because the domain doesn't exist is clear and true. > An alternative, have the VDSM register the migration succeed event and do > not forward the error.
Verified on SF13: 60 migrations canceled, time between start and cancel vary from 0.1 to 6 secs in 0.1 steps, using the following SDK script: vm_list = api.vms.list() myvm = vm_list[0] print myvm.name sleep = 0.1 sleepinterval = 0.1 while sleep < 6: print sleep, myvm.name myvm.migrate() time.sleep(sleep) myvm.cancelmigration() sleep += sleepinterval time.sleep(60)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0888.html