878778 – engine [RACE]: cancel migration will fail because domain no longer exists in src by the time cancel is sent

Bug 878778 - engine [RACE]: cancel migration will fail because domain no longer exists in src by the time cancel is sent

Summary: engine [RACE]: cancel migration will fail because domain no longer exists in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Roy Golan
QA Contact:	Barak Dagan
Docs Contact:
URL:
Whiteboard:	virt
Depends On:	922490
Blocks:	915537
TreeView+	depends on / blocked

Reported:	2012-11-21 08:32 UTC by Dafna Ron
Modified:	2022-07-09 06:10 UTC (History)
CC List:	12 users (show)
Fixed In Version:	sf6
Doc Type:	Bug Fix
Doc Text:	Cancelling a migration failed when the virtual machine had already migrated to the destination host or when the virtual machine crashed on the source host. When this happened, VDSM returned the error "Virtual machine does not exist" which was neither descriptive nor helpful. Now, when cancelling a migration fails, the user is prompted to try again and to check the virtual machine logs.
Clone Of:
Environment:
Last Closed:	2013-06-10 21:22:28 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (987.21 KB, application/x-gzip) 2012-11-21 08:32 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-47198	None	None	None	2022-07-09 06:10:53 UTC
Red Hat Product Errata	RHSA-2013:0888	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Virtualization Manager 3.2 update	2013-06-11 00:55:41 UTC
oVirt gerrit	11225	None	None	None	Never

Description Dafna Ron 2012-11-21 08:32:02 UTC

Created attachment 649110 [details]
logs

Description of problem:

we fail to cancel migration because by the time engine sends the command to vdsm libvirt already destroyed the domain in the src. 

VM Mig is down. Exit message: Domain not found: no domain with matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'.

Version-Release number of selected component (if applicable):

si24.4
vdsm-4.9.6-44.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.6.x86_64

How reproducible:

race

Steps to Reproduce:
1. send cancel migration right when the vm finished migrating in vdsm
2.
3.
  
Actual results:

we fail to cancel migration with "domain does not exist" error. 

Expected results:

if we cannot solve the race I suggest creating a better message for the user. 

Additional info: logs

engine: 

2012-11-21 10:10:56,624 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-90) Error code noVM and error message VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist
2012-11-21 10:10:56,625 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-90) Command org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         1
mMessage                      Virtual machine does not exist



vdsm:

libvirtEventLoop::DEBUG::2012-11-21 10:10:53,978::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/service ksmtuned retune' (cwd None)
Thread-35620::ERROR::2012-11-21 10:10:53,989::vm::631::vm.Vm::(_startUnderlyingVm) vmId=`a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 611, in _startUnderlyingVm
    self._waitForIncomingMigrationFinish()
  File "/usr/share/vdsm/libvirtvm.py", line 1650, in _waitForIncomingMigrationFinish
    self._connection.lookupByUUIDString(self.id),
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2682, in lookupByUUIDString
    if ret is None:raise libvirtError('virDomainLookupByUUIDString() failed', conn=self)
libvirtError: Domain not found: no domain with matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'
Thread-35620::DEBUG::2012-11-21 10:10:53,995::vm::969::vm.Vm::(setDownStatus) vmId=`a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e`::Changed state to Down: Domain not found: no domain wi
th matching uuid 'a6fac067-d4d9-44ad-b0cd-cdf521bb2a8e'

Comment 1 Michal Skrivanek 2012-12-13 12:39:54 UTC

we should do better for the message. Either change VDSM to report meaningful user-oriented explanation (like "migration most likely failed") or add engine's interpretation on top?

Comment 2 Roy Golan 2012-12-13 15:02:28 UTC

its not a perfect solution but maybe the least we can do.

just to add:
1. its defiantly not a race we should solve (we shouldn't stop the world on that one)
2. cross-component error reporting -  returned errors are not parametrized (just a string) and therefore cannot be interpolated, I18N or be parsed in a reliable way to have logic made upon so we are left with concatenating messages upon them.

Comment 3 Simon Grinberg 2012-12-24 17:13:57 UTC

(In reply to comment #1)
> we should do better for the message. Either change VDSM to report meaningful
> user-oriented explanation (like "migration most likely failed") or add
> engine's interpretation on top?

But migration did succeed in this case.

There are two cases when libvirt will provide this error:
1. The source VM had crushed. 
2. The migration just completed. 

In both cases the cancel migration has failed. What needs to be done is for the backend, on error query both hosts for this VMs. 

If the VM found on the destination, all is well, print could not cancel migration.
If the VM not found on both, it's actually VM failure and should be reported as such and if required given rerun treatment.

Comment 4 Roy Golan 2012-12-27 14:50:18 UTC

(In reply to comment #3)
> (In reply to comment #1)
> > we should do better for the message. Either change VDSM to report meaningful
> > user-oriented explanation (like "migration most likely failed") or add
> > engine's interpretation on top?
> 
> But migration did succeed in this case.
> 
> There are two cases when libvirt will provide this error:
> 1. The source VM had crushed. 
> 2. The migration just completed. 
> 
> In both cases the cancel migration has failed. What needs to be done is for
> the backend, on error query both hosts for this VMs. 
> 
> If the VM found on the destination, all is well, print could not cancel
> migration.

the cancel migration action doesn't have a context and the VM dynamic itself is cleaned very quickly so I cant tell the destination host for sure.

> If the VM not found on both, it's actually VM failure and should be reported
> as such and if required given rerun treatment.

Comment 5 Simon Grinberg 2012-12-27 16:15:20 UTC

(In reply to comment #4)
> the cancel migration action doesn't have a context and the VM dynamic itself
> is cleaned very quickly so I cant tell the destination host for sure.

I'll rephrase:
- If cancel migration is sent, and the VM is not on the source host, but is found on another host - Cancel migration failed. (Hide the domain not found error)
- If cancel migration is send and VM not found on any host - it's actually VM failure and should be reported as such and if required given rerun treatment.

An alternative, have the VDSM register the migration succeed event and do not forward the error.

Comment 6 Roy Golan 2012-12-30 08:22:03 UTC

> - If cancel migration is sent, and the VM is not on the source host, but is
> found on another host - Cancel migration failed. (Hide the domain not found
> error)
> - If cancel migration is send and VM not found on any host - it's actually
> VM failure and should be reported as such and if required given rerun
> treatment.
its problematic, the VM could be already during the rerun process and the current behaviour is not cancelling the engine migration process but only the target host migration task.
So in this case the error that the migration failed because the domain doesn't exist is clear and true.

> An alternative, have the VDSM register the migration succeed event and do
> not forward the error.

Comment 15 Barak Dagan 2013-04-11 15:51:03 UTC

Verified on SF13:
60 migrations canceled, time between start and cancel vary from 0.1 to 6 secs in 0.1 steps, using the following SDK script:

vm_list = api.vms.list()
myvm = vm_list[0]
print myvm.name

sleep = 0.1
sleepinterval = 0.1
while sleep < 6:
    print sleep, myvm.name
    myvm.migrate()
    time.sleep(sleep)
    myvm.cancelmigration()
    sleep += sleepinterval
    time.sleep(60)

Comment 18 errata-xmlrpc 2013-06-10 21:22:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html

Note You need to log in before you can comment on or make changes to this bug.