Bug 966448 - engine: cannot move images again if LSM failed without cleanup
Summary: engine: cannot move images again if LSM failed without cleanup
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.4.0
Assignee: Federico Simoncelli
QA Contact: Leonid Natapov
URL:
Whiteboard: storage
Depends On:
Blocks: rhev3.4beta 1142926
TreeView+ depends on / blocked
 
Reported: 2013-05-23 10:11 UTC by Dafna Ron
Modified: 2016-02-10 17:55 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-12 14:04:04 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (310.00 KB, application/x-gzip)
2013-05-23 10:11 UTC, Dafna Ron
no flags Details

Description Dafna Ron 2013-05-23 10:11:51 UTC
Created attachment 752099 [details]
logs

Description of problem:

we do not always have cleanup when LSM failed and although we rollback on engine we leave the vm images in the target domain. 
if we try to re-move the images (live or offline) we fail with image/volume already exist error from vdsm which requires gss involvement. 

Version-Release number of selected component (if applicable):

sf17.1

How reproducible:

100%

Steps to Reproduce:
1. create a vm from template on iscsi storage and run it on hsm host
2. restart vdsm on spm when VmReplicateDiskFinishVDSCommand is run 
3. shut down the vm
4. try to move the disk again offline

Actual results:

we fail in vdsm with "volume already exists" error

Expected results:

we should provide a way for the user to decide if they want to call gss or deal with the conflict (one example I can think of is a conflict found wizard)

Additional info:

Comment 1 Federico Simoncelli 2013-10-11 13:26:47 UTC
(In reply to Dafna Ron from comment #0)
> Steps to Reproduce:
> 1. create a vm from template on iscsi storage and run it on hsm host
> 2. restart vdsm on spm when VmReplicateDiskFinishVDSCommand is run

I'm not sure how the vdsm restart influenced the flow and sadly the attached vdsm logs are not related to this bug (different time or host, there's no restart).
I can't imagine a reason why restarting vdsm on spm (when syncImageData finished) should affect VmReplicateDiskFinish running on another host.

What I can see instead in the engine logs is VmReplicateDiskStart failing (not related to any restart):

2013-05-23 11:43:09,397 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-4-thread-50) [3a9a74d9] START, VmReplicateDiskStartVDSCommand(HostName = cougar02, HostId = a4af81ea-c847-4c9c-885e-cb763d5bf9ea, vmId=bf027b25-b9f1-4c42-bdfa-d7aca45f2405), log id: d890de2

...

2013-05-23 11:43:16,272 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Weird return value: StatusForXmlRpc [mCode=55, mMessage=Drive replication error]
2013-05-23 11:43:16,273 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Failed in VmReplicateDiskStartVDS method
2013-05-23 11:43:16,274 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Error code unexpected and error message VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error
2013-05-23 11:43:16,274 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Command org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand return value
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=55, mMessage=Drive replication error]]

We don't have any detail on this as the vdsm.log is not containing anything related to drive replica.
And after that also VmReplicateDiskFinish fails as expected:

2013-05-23 11:43:16,301 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-4-thread-50) [3a9a74d9] START, VmReplicateDiskFinishVDSCommand(HostName = cougar02, HostId = a4af81ea-c847-4c9c-885e-cb763d5bf9ea, vmId=bf027b25-b9f1-4c42-bdfa-d7aca45f2405), log id: 1ee32150
2013-05-23 11:43:16,325 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Weird return value: StatusForXmlRpc [mCode=55, mMessage=Drive replication error]
2013-05-23 11:43:16,325 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Failed in VmReplicateDiskFinishVDS method
2013-05-23 11:43:16,325 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Error code unexpected and error message VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error
2013-05-23 11:43:16,325 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-50) [3a9a74d9] Command org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand return value
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=55, mMessage=Drive replication error]]

The next move fails as there was no cleanup:

2013-05-23 12:17:03,273 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (pool-4-thread-50) [435a81d9] START, MoveImageGroupVDSCommand( storagePoolId = 7fd33b43-a9f4-4eb7-a885-e9583a929ceb, ignoreFailoverLimit = false, compatabilityVersion = 3.2, storageDomainId = 81ef11d0-4c0c-47b4-8953-d61a6af442d8, imageGroupId = 1ea5b233-255a-45ef-958a-499892624c3b, dstDomainId = 38755249-4bb3-4841-bf5b-05f4a521514d, vmId = 00000000-0000-0000-0000-000000000000, op = Move, postZero = false, force = false), log id: 6f37e09e

2013-05-23 12:17:17,145 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-88) BaseAsyncTask::LogEndTaskFailure: Task d40ece4f-b240-4872-abdf-565c8ecdf039 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists,
-- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Volume already exists


I tested this behavior again on the current master (git hash 44ee6e6) failing with the "Drive replication error" on VmReplicateDiskStart and the destination image is now cleaned properly.

2013-10-11 15:10:12,940 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-6-thread-49) [1fe559bd] Running command: LiveMigrateDiskCommand Task handler: VmReplicateDiskStartTaskHandler internal: fa
lse. Entities affected :  ID: 3995b4f0-f13e-4650-a1db-0afccc7fdd5e Type: Disk,  ID: 864df2d0-b022-435f-a9d1-b7ac0bd766bb Type: Storage
2013-10-11 15:10:12,942 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-6-thread-49) [1fe559bd] START, VmReplicateDiskStartVDSCommand(HostName = vm-rhev1, HostId = 43bd4b0f-3
10d-4ba3-b357-5a08f19cd683, vmId=b6b7783e-971a-4607-9493-bfd474bcf796), log id: 7e926841
2013-10-11 15:10:12,946 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-6-thread-49) [1fe559bd] Failed in VmReplicateDiskStartVDS method
2013-10-11 15:10:12,946 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-6-thread-49) [1fe559bd] Error code replicaErr and error message VDSGenericException: VDSErrorException
: Failed to VmReplicateDiskStartVDS, error = Drive replication error
2013-10-11 15:10:12,946 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-6-thread-49) [1fe559bd] Command org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSComma
nd return value
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=55, mMessage=Drive replication error]]

...

2013-10-11 15:10:12,950 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-6-thread-49) [1fe559bd] START, VmReplicateDiskFinishVDSCommand(HostName = vm-rhev1, HostId = 43bd4b0f
-310d-4ba3-b357-5a08f19cd683, vmId=b6b7783e-971a-4607-9493-bfd474bcf796), log id: 38659aca
2013-10-11 15:10:12,954 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-6-thread-49) [1fe559bd] Failed in VmReplicateDiskFinishVDS method
2013-10-11 15:10:12,954 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-6-thread-49) [1fe559bd] Error code replicaErr and error message VDSGenericException: VDSErrorExceptio
n: Failed to VmReplicateDiskFinishVDS, error = Drive replication error
2013-10-11 15:10:12,954 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-6-thread-49) [1fe559bd] Command org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCom
mand return value
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=55, mMessage=Drive replication error]]

...

2013-10-11 15:10:12,956 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-6-thread-49) [1fe559bd] Reverting task deleteImage, handler: org.ovirt.engine.core.bll.lsm.CreateImagePlaceholderTaskHandl
er
2013-10-11 15:10:12,979 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-49) [1fe559bd] START, DeleteImageGroupVDSCommand( storagePoolId = 98da1408-948d-4cab-9a8b-418914b
e9f07, ignoreFailoverLimit = false, storageDomainId = 864df2d0-b022-435f-a9d1-b7ac0bd766bb, imageGroupId = 3995b4f0-f13e-4650-a1db-0afccc7fdd5e, postZeros = false, forceDelete = false), log id: 51e45ee0
2013-10-11 15:10:14,095 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-49) [1fe559bd] FINISH, DeleteImageGroupVDSCommand, log id: 51e45ee0


Moving to ON_QA.

Comment 2 Leonid Natapov 2014-01-23 11:11:15 UTC
verified on  3.4.0-0.5.beta1.el6. destination image is being cleaned. Move after failure succeeded.

Comment 3 Itamar Heim 2014-06-12 14:04:04 UTC
Closing as part of 3.4.0


Note You need to log in before you can comment on or make changes to this bug.