This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 878778 - engine [RACE]: cancel migration will fail because domain no longer exists in src by the time cancel is sent
engine [RACE]: cancel migration will fail because domain no longer exists in ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
unspecified
x86_64 Linux
medium Severity medium
: ---
: 3.2.0
Assigned To: Roy Golan
Barak Dagan
virt
:
Depends On: 922490
Blocks: 915537
  Show dependency treegraph
 
Reported: 2012-11-21 03:32 EST by Dafna Ron
Modified: 2014-07-13 19:18 EDT (History)
12 users (show)

See Also:
Fixed In Version: sf6
Doc Type: Bug Fix
Doc Text:
Cancelling a migration failed when the virtual machine had already migrated to the destination host or when the virtual machine crashed on the source host. When this happened, VDSM returned the error "Virtual machine does not exist" which was neither descriptive nor helpful. Now, when cancelling a migration fails, the user is prompted to try again and to check the virtual machine logs.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-10 17:22:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
logs (987.21 KB, application/x-gzip)
2012-11-21 03:32 EST, Dafna Ron
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 11225 None None None Never

  None (edit)
Description Dafna Ron 2012-11-21 03:32:02 EST
Created attachment 649110 [details]
logs

Description of problem:

we fail to cancel migration because by the time engine sends the command to vdsm libvirt already destroyed the domain in the src. 

VM Mig is down. Exit message: Domain not found: no domain with matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'.

Version-Release number of selected component (if applicable):

si24.4
vdsm-4.9.6-44.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.6.x86_64

How reproducible:

race

Steps to Reproduce:
1. send cancel migration right when the vm finished migrating in vdsm
2.
3.
  
Actual results:

we fail to cancel migration with "domain does not exist" error. 

Expected results:

if we cannot solve the race I suggest creating a better message for the user. 

Additional info: logs

engine: 

2012-11-21 10:10:56,624 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-90) Error code noVM and error message VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist
2012-11-21 10:10:56,625 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-90) Command org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         1
mMessage                      Virtual machine does not exist



vdsm:

libvirtEventLoop::DEBUG::2012-11-21 10:10:53,978::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/service ksmtuned retune' (cwd None)
Thread-35620::ERROR::2012-11-21 10:10:53,989::vm::631::vm.Vm::(_startUnderlyingVm) vmId=`a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 611, in _startUnderlyingVm
    self._waitForIncomingMigrationFinish()
  File "/usr/share/vdsm/libvirtvm.py", line 1650, in _waitForIncomingMigrationFinish
    self._connection.lookupByUUIDString(self.id),
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2682, in lookupByUUIDString
    if ret is None:raise libvirtError('virDomainLookupByUUIDString() failed', conn=self)
libvirtError: Domain not found: no domain with matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'
Thread-35620::DEBUG::2012-11-21 10:10:53,995::vm::969::vm.Vm::(setDownStatus) vmId=`a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e`::Changed state to Down: Domain not found: no domain wi
th matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'
Comment 1 Michal Skrivanek 2012-12-13 07:39:54 EST
we should do better for the message. Either change VDSM to report meaningful user-oriented explanation (like "migration most likely failed") or add engine's interpretation on top?
Comment 2 Roy Golan 2012-12-13 10:02:28 EST
its not a perfect solution but maybe the least we can do.

just to add:
1. its defiantly not a race we should solve (we shouldn't stop the world on that one)
2. cross-component error reporting -  returned errors are not parametrized (just a string) and therefore cannot be interpolated, I18N or be parsed in a reliable way to have logic made upon so we are left with concatenating messages upon them.
Comment 3 Simon Grinberg 2012-12-24 12:13:57 EST
(In reply to comment #1)
> we should do better for the message. Either change VDSM to report meaningful
> user-oriented explanation (like "migration most likely failed") or add
> engine's interpretation on top?

But migration did succeed in this case.

There are two cases when libvirt will provide this error:
1. The source VM had crushed. 
2. The migration just completed. 

In both cases the cancel migration has failed. What needs to be done is for the backend, on error query both hosts for this VMs. 

If the VM found on the destination, all is well, print could not cancel migration.
If the VM not found on both, it's actually VM failure and should be reported as such and if required given rerun treatment.
Comment 4 Roy Golan 2012-12-27 09:50:18 EST
(In reply to comment #3)
> (In reply to comment #1)
> > we should do better for the message. Either change VDSM to report meaningful
> > user-oriented explanation (like "migration most likely failed") or add
> > engine's interpretation on top?
> 
> But migration did succeed in this case.
> 
> There are two cases when libvirt will provide this error:
> 1. The source VM had crushed. 
> 2. The migration just completed. 
> 
> In both cases the cancel migration has failed. What needs to be done is for
> the backend, on error query both hosts for this VMs. 
> 
> If the VM found on the destination, all is well, print could not cancel
> migration.

the cancel migration action doesn't have a context and the VM dynamic itself is cleaned very quickly so I cant tell the destination host for sure.

> If the VM not found on both, it's actually VM failure and should be reported
> as such and if required given rerun treatment.
Comment 5 Simon Grinberg 2012-12-27 11:15:20 EST
(In reply to comment #4)
> the cancel migration action doesn't have a context and the VM dynamic itself
> is cleaned very quickly so I cant tell the destination host for sure.

I'll rephrase:
- If cancel migration is sent, and the VM is not on the source host, but is found on another host - Cancel migration failed. (Hide the domain not found error)
- If cancel migration is send and VM not found on any host - it's actually VM failure and should be reported as such and if required given rerun treatment.

An alternative, have the VDSM register the migration succeed event and do not forward the error.
Comment 6 Roy Golan 2012-12-30 03:22:03 EST
> - If cancel migration is sent, and the VM is not on the source host, but is
> found on another host - Cancel migration failed. (Hide the domain not found
> error)
> - If cancel migration is send and VM not found on any host - it's actually
> VM failure and should be reported as such and if required given rerun
> treatment.
its problematic, the VM could be already during the rerun process and the current behaviour is not cancelling the engine migration process but only the target host migration task.
So in this case the error that the migration failed because the domain doesn't exist is clear and true.

> An alternative, have the VDSM register the migration succeed event and do
> not forward the error.
Comment 15 Barak Dagan 2013-04-11 11:51:03 EDT
Verified on SF13:
60 migrations canceled, time between start and cancel vary from 0.1 to 6 secs in 0.1 steps, using the following SDK script:

vm_list = api.vms.list()
myvm = vm_list[0]
print myvm.name

sleep = 0.1
sleepinterval = 0.1
while sleep < 6:
    print sleep, myvm.name
    myvm.migrate()
    time.sleep(sleep)
    myvm.cancelmigration()
    sleep += sleepinterval
    time.sleep(60)
Comment 18 errata-xmlrpc 2013-06-10 17:22:28 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html

Note You need to log in before you can comment on or make changes to this bug.