957703 – engine: can't live migrate vm's disk after a failure because image already exists in the target

Bug 957703 - engine: can't live migrate vm's disk after a failure because image already exists in the target

Summary: engine: can't live migrate vm's disk after a failure because image already ex...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Federico Simoncelli
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	979753 1019461 3.3snap2
TreeView+	depends on / blocked

Reported:	2013-04-29 10:20 UTC by Dafna Ron
Modified:	2018-12-02 15:48 UTC (History)
CC List:	11 users (show)
Fixed In Version:	is21
Doc Type:	Bug Fix
Doc Text:	When a virtual machine disappeared or crashed during a Live Storage Migration, the newly created image on the destination storage domain was not cleaned up, preventing future migrations. Now the images on the relevant storage domain are properly removed.
Clone Of:
Environment:
Last Closed:	2014-01-21 17:19:32 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	abaron: Triaged+

Attachments	(Terms of Use)
logs (1.90 MB, application/x-gzip) 2013-04-29 10:20 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:0038	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.3.0 update	2014-01-21 22:03:06 UTC
oVirt gerrit	19876	None	MERGED	backend: remove vm check from LiveMigrateDiskCommand	2020-07-02 11:56:05 UTC
oVirt gerrit	20432	None	MERGED	backend: remove vm check from LiveMigrateDiskCommand	2020-07-02 11:56:04 UTC

Description Dafna Ron 2013-04-29 10:20:56 UTC

Created attachment 741425 [details]
logs

Description of problem:

I killed -9 a vm during the cloneImageStructure step of storage live migration which means that the lv's are already created on the other domain (target domain). 
trying to migrate the disk a second time (after re-running the vm) we fail because the images already exist in the target. 

Version-Release number of selected component (if applicable):

sf14

How reproducible:

100%

Steps to Reproduce:
1. kill -9 during the cloneImageStructure task of live storage migration
2. re-run the vm
3. try to migrate the disk to the same domain
  
Actual results:

we fail because Image already exist in target and the only way to clean it is by lvremove on vds side. 

Expected results:

we should succeed second move if we had a failure on the first move

Additional info: full logs

first move failure: 

2013-04-29 10:44:36,412 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-42) [46ed926] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:44:36,417 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:44:36,429 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: vds for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:44:36,444 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result.
2013-04-29 10:44:36,445 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk succeeded, clearing tasks.
2013-04-29 10:44:36,448 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-42) SPMAsyncTask::ClearAsyncTask: Attempting to clear task f77778af-f651-4fee-9e4f-788e6987f658

second move failure: 

2013-04-29 10:52:17,215 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-78) SPMAsyncTask::PollTask: Polling task 57d1151e-54b0-41ed-9365-a18a653e9fd8 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engi
ne.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'cleanSuccess'.
2013-04-29 10:52:17,218 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-78) BaseAsyncTask::LogEndTaskFailure: Task 57d1151e-54b0-41ed-9365-a18a653e9fd8 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists,
-- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists
2013-04-29 10:52:17,219 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (QuartzScheduler_Worker-78) EntityAsyncTask::EndActionIfNecessary: All tasks of entity b634a653-7042-4d19-aa24-296484e59bd0 has ended -> executing EndAction
2013-04-29 10:52:17,219 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (QuartzScheduler_Worker-78) EntityAsyncTask::EndAction: Ending action for 1 tasks (entity ID: b634a653-7042-4d19-aa24-296484e59bd0): calling EndAction for action type LiveMigrateDisk.
2013-04-29 10:52:17,219 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::EndCommandAction [within thread] context: Attempting to EndAction LiveMigrateDisk, executionIndex: 0
2013-04-29 10:52:17,234 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-42) [67b9687f] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:52:17,238 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:52:17,240 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: vds for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:52:17,247 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result.
2013-04-29 10:52:17,247 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk succeeded, clearing tasks.
2013-04-29 10:52:17,250 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-42) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 57d1151e-54b0-41ed-9365-a18a653e9fd8


vdsm side failure: 

57d1151e-54b0-41ed-9365-a18a653e9fd8::ERROR::2013-04-29 10:52:13,826::task::850::TaskManager.Task::(_setError) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1769, in cloneImageStructure
    image.Image(repoPath).cloneStructure(sdUUID, imgUUID, dstSdUUID)
  File "/usr/share/vdsm/storage/image.py", line 649, in cloneStructure
    self._createTargetImage(sdCache.produce(dstSdUUID), sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/image.py", line 517, in _createTargetImage
    srcVolUUID=volParams['parent'])
  File "/usr/share/vdsm/storage/blockSD.py", line 610, in createVolume
    volUUID, desc, srcImgUUID, srcVolUUID)
  File "/usr/share/vdsm/storage/volume.py", line 418, in create
    raise se.VolumeAlreadyExists(volUUID)
VolumeAlreadyExists: Volume already exists: ('b634a653-7042-4d19-aa24-296484e59bd0',)
57d1151e-54b0-41ed-9365-a18a653e9fd8::DEBUG::2013-04-29 10:52:13,828::task::869::TaskManager.Task::(_run) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::Task._run: 57d1151e-54b0-41ed-9365-a18a653e9fd8 () {} failed - stopping task
57d1151e-54b0-41ed-9365-a18a653e9fd8::DEBUG::2013-04-29 10:52:13,828::task::1194::TaskManager.Task::(stop) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::stopping in state running (force False)
57d1151e-54b0-41ed-9365-a18a653e9fd8::DEBUG::2013-04-29 10:52:13,829::task::974::TaskManager.Task::(_decref) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::ref 1 aborting True

Comment 1 Federico Simoncelli 2013-10-04 13:31:33 UTC

What happens here is:

1. CloneImageGroupStructure is sent:

2013-04-29 10:44:18,218 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CloneImageGroupStructureVDSCommand] (pool-4-thread-48) [46ed926] START, CloneImageGroupStructureVDSCommand( storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = d326916a-89b6-41c8-9c14-4b9cf8a1c979, imageGroupId = 8220610b-1bee-4beb-918f-368cb7d4848a, dstDomainId = 81ef11d0-4c0c-47b4-8953-d61a6af442d8), log id: 5d6db40f

2. VM crashes and engine sends vmDestroy:

2013-04-29 10:44:27,802 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-28) START, DestroyVDSCommand(HostName = gold-vdsd, HostId = 83834e1f-9e60-41b5-a9cc-16460a8a2fe2, vmId=6725d737-834d-4f41-b64d-78b2c4885ba2, force=false, secondsToWait=0, gracefully=false), log id: c15c965
2013-04-29 10:44:27,827 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-28) FINISH, DestroyVDSCommand, log id: c15c965

3. CloneImageGroupStructure task ends with failure:

2013-04-29 10:44:36,390 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::EndCommandAction [within thread] context: Attempting to EndAction LiveMigrateDisk, executionIndex: 0
2013-04-29 10:44:36,412 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-42) [46ed926] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand


CloneImageGroupStructure ends with failure because in LiveMigrateDiskCommand.java we check if the VM isRunningOrPaused:

    public LiveMigrateDiskCommand(T parameters) {
        super(parameters);
        ...
        getParameters().setTaskGroupSuccess(getParameters().getTaskGroupSuccess()
                && getVm().getStatus().isRunningOrPaused());
    }

Since the VM is down setTaskGroupSuccess is set to false and that triggers internalEndWithFailure:

    private void internalEndWithFailure() {
        log.errorFormat("Ending command with failure: {0}", getClass().getName());
        if (hasTaskHandlers()) {
            if (hasStepsToRevert()) {
                getCurrentTaskHandler().endWithFailure();
                revertPreviousHandlers();
            }
            startPollingAsyncTasks();
        } else {
            endWithFailure();
        }
        rollbackQuota();
    }

and for the current task handler (CloneImageGroupStructure) we call only endWithFailure() (instead of compensate that would remove the destination image).

One possible (or additional) fix is to call compensate() also for the current task handler (which would be handy for ImportRepoImage too).

Anyway in general I think that removing:

        getParameters().setTaskGroupSuccess(getParameters().getTaskGroupSuccess()
                && getVm().getStatus().isRunningOrPaused());

would be a more appropriate fix as there's no reason to fail the end action of CloneImageGroupStructure. We'll delegate the failure to the next handler (VmReplicateDiskStartTaskHandler) and we can benefit of compensate() being called by revertPreviousHandlers().

Comment 2 Charlie 2013-11-28 00:17:26 UTC

This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 3 Leonid Natapov 2013-11-28 08:12:56 UTC

is24.2 disk successfully live migrated second time after killing qemu process while first live migration

Comment 4 errata-xmlrpc 2014-01-21 17:19:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Note You need to log in before you can comment on or make changes to this bug.