Bug 957703

Summary: engine: can't live migrate vm's disk after a failure because image already exists in the target
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: high    
Version: 3.2.0CC: abaron, acanan, acathrow, amureini, fsimonce, iheim, jkt, lpeer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Flags: abaron: Triaged+
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: is21 Doc Type: Bug Fix
Doc Text:
When a virtual machine disappeared or crashed during a Live Storage Migration, the newly created image on the destination storage domain was not cleaned up, preventing future migrations. Now the images on the relevant storage domain are properly removed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 17:19:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 979753, 1019461, 1032811    
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-04-29 10:20:56 UTC
Created attachment 741425 [details]
logs

Description of problem:

I killed -9 a vm during the cloneImageStructure step of storage live migration which means that the lv's are already created on the other domain (target domain). 
trying to migrate the disk a second time (after re-running the vm) we fail because the images already exist in the target. 

Version-Release number of selected component (if applicable):

sf14

How reproducible:

100%

Steps to Reproduce:
1. kill -9 during the cloneImageStructure task of live storage migration
2. re-run the vm
3. try to migrate the disk to the same domain
  
Actual results:

we fail because Image already exist in target and the only way to clean it is by lvremove on vds side. 

Expected results:

we should succeed second move if we had a failure on the first move

Additional info: full logs

first move failure: 

2013-04-29 10:44:36,412 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-42) [46ed926] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:44:36,417 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:44:36,429 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: vds for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:44:36,444 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result.
2013-04-29 10:44:36,445 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk succeeded, clearing tasks.
2013-04-29 10:44:36,448 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-42) SPMAsyncTask::ClearAsyncTask: Attempting to clear task f77778af-f651-4fee-9e4f-788e6987f658

second move failure: 

2013-04-29 10:52:17,215 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-78) SPMAsyncTask::PollTask: Polling task 57d1151e-54b0-41ed-9365-a18a653e9fd8 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engi
ne.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'cleanSuccess'.
2013-04-29 10:52:17,218 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-78) BaseAsyncTask::LogEndTaskFailure: Task 57d1151e-54b0-41ed-9365-a18a653e9fd8 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists,
-- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists
2013-04-29 10:52:17,219 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (QuartzScheduler_Worker-78) EntityAsyncTask::EndActionIfNecessary: All tasks of entity b634a653-7042-4d19-aa24-296484e59bd0 has ended -> executing EndAction
2013-04-29 10:52:17,219 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (QuartzScheduler_Worker-78) EntityAsyncTask::EndAction: Ending action for 1 tasks (entity ID: b634a653-7042-4d19-aa24-296484e59bd0): calling EndAction for action type LiveMigrateDisk.
2013-04-29 10:52:17,219 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::EndCommandAction [within thread] context: Attempting to EndAction LiveMigrateDisk, executionIndex: 0
2013-04-29 10:52:17,234 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-42) [67b9687f] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:52:17,238 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: glusterVolume for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:52:17,240 WARN  [org.ovirt.engine.core.compat.backendcompat.PropertyInfo] (pool-4-thread-42) Unable to get value of property: vds for class org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand
2013-04-29 10:52:17,247 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result.
2013-04-29 10:52:17,247 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk succeeded, clearing tasks.
2013-04-29 10:52:17,250 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (pool-4-thread-42) SPMAsyncTask::ClearAsyncTask: Attempting to clear task 57d1151e-54b0-41ed-9365-a18a653e9fd8


vdsm side failure: 

57d1151e-54b0-41ed-9365-a18a653e9fd8::ERROR::2013-04-29 10:52:13,826::task::850::TaskManager.Task::(_setError) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1769, in cloneImageStructure
    image.Image(repoPath).cloneStructure(sdUUID, imgUUID, dstSdUUID)
  File "/usr/share/vdsm/storage/image.py", line 649, in cloneStructure
    self._createTargetImage(sdCache.produce(dstSdUUID), sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/image.py", line 517, in _createTargetImage
    srcVolUUID=volParams['parent'])
  File "/usr/share/vdsm/storage/blockSD.py", line 610, in createVolume
    volUUID, desc, srcImgUUID, srcVolUUID)
  File "/usr/share/vdsm/storage/volume.py", line 418, in create
    raise se.VolumeAlreadyExists(volUUID)
VolumeAlreadyExists: Volume already exists: ('b634a653-7042-4d19-aa24-296484e59bd0',)
57d1151e-54b0-41ed-9365-a18a653e9fd8::DEBUG::2013-04-29 10:52:13,828::task::869::TaskManager.Task::(_run) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::Task._run: 57d1151e-54b0-41ed-9365-a18a653e9fd8 () {} failed - stopping task
57d1151e-54b0-41ed-9365-a18a653e9fd8::DEBUG::2013-04-29 10:52:13,828::task::1194::TaskManager.Task::(stop) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::stopping in state running (force False)
57d1151e-54b0-41ed-9365-a18a653e9fd8::DEBUG::2013-04-29 10:52:13,829::task::974::TaskManager.Task::(_decref) Task=`57d1151e-54b0-41ed-9365-a18a653e9fd8`::ref 1 aborting True

Comment 1 Federico Simoncelli 2013-10-04 13:31:33 UTC
What happens here is:

1. CloneImageGroupStructure is sent:

2013-04-29 10:44:18,218 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CloneImageGroupStructureVDSCommand] (pool-4-thread-48) [46ed926] START, CloneImageGroupStructureVDSCommand( storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = d326916a-89b6-41c8-9c14-4b9cf8a1c979, imageGroupId = 8220610b-1bee-4beb-918f-368cb7d4848a, dstDomainId = 81ef11d0-4c0c-47b4-8953-d61a6af442d8), log id: 5d6db40f

2. VM crashes and engine sends vmDestroy:

2013-04-29 10:44:27,802 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-28) START, DestroyVDSCommand(HostName = gold-vdsd, HostId = 83834e1f-9e60-41b5-a9cc-16460a8a2fe2, vmId=6725d737-834d-4f41-b64d-78b2c4885ba2, force=false, secondsToWait=0, gracefully=false), log id: c15c965
2013-04-29 10:44:27,827 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (QuartzScheduler_Worker-28) FINISH, DestroyVDSCommand, log id: c15c965

3. CloneImageGroupStructure task ends with failure:

2013-04-29 10:44:36,390 INFO  [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-4-thread-42) EntityAsyncTask::EndCommandAction [within thread] context: Attempting to EndAction LiveMigrateDisk, executionIndex: 0
2013-04-29 10:44:36,412 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-4-thread-42) [46ed926] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand


CloneImageGroupStructure ends with failure because in LiveMigrateDiskCommand.java we check if the VM isRunningOrPaused:

    public LiveMigrateDiskCommand(T parameters) {
        super(parameters);
        ...
        getParameters().setTaskGroupSuccess(getParameters().getTaskGroupSuccess()
                && getVm().getStatus().isRunningOrPaused());
    }

Since the VM is down setTaskGroupSuccess is set to false and that triggers internalEndWithFailure:

    private void internalEndWithFailure() {
        log.errorFormat("Ending command with failure: {0}", getClass().getName());
        if (hasTaskHandlers()) {
            if (hasStepsToRevert()) {
                getCurrentTaskHandler().endWithFailure();
                revertPreviousHandlers();
            }
            startPollingAsyncTasks();
        } else {
            endWithFailure();
        }
        rollbackQuota();
    }

and for the current task handler (CloneImageGroupStructure) we call only endWithFailure() (instead of compensate that would remove the destination image).

One possible (or additional) fix is to call compensate() also for the current task handler (which would be handy for ImportRepoImage too).

Anyway in general I think that removing:

        getParameters().setTaskGroupSuccess(getParameters().getTaskGroupSuccess()
                && getVm().getStatus().isRunningOrPaused());

would be a more appropriate fix as there's no reason to fail the end action of CloneImageGroupStructure. We'll delegate the failure to the next handler (VmReplicateDiskStartTaskHandler) and we can benefit of compensate() being called by revertPreviousHandlers().

Comment 2 Charlie 2013-11-28 00:17:26 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 3 Leonid Natapov 2013-11-28 08:12:56 UTC
is24.2 disk successfully live migrated second time after killing qemu process while first live migration

Comment 4 errata-xmlrpc 2014-01-21 17:19:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html