Bug 1016794

Summary: MoveDisksCommand is not completed if the SPM is lost between copy and delete
Product: [oVirt] ovirt-engine Reporter: Federico Simoncelli <fsimonce>
Component: GeneralAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: medium Docs Contact:
Priority: unspecified    
Version: ---CC: amureini, bugs, fsimonce, laravot, lpeer, ratamir, rbalakri, Rhev-m-bugs, scohen, srevivo, tnisan, ylavi
Target Milestone: ovirt-4.1.1Keywords: TestOnly
Target Release: ---Flags: rule-engine: ovirt-4.1+
rule-engine: planning_ack+
rule-engine: devel_ack+
ratamir: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-21 09:44:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
mvimage-delete-error.tar.gz none

Description Federico Simoncelli 2013-10-08 17:16:28 UTC
Created attachment 809432 [details]
mvimage-delete-error.tar.gz

Description of problem:
If the SPM role is lost after the copy successfully finished then the delete command might fail.

Version-Release number of selected component (if applicable):
Encountered upstream on git hash 178258b, but the relevant code was introduced in 421e8ec (core: move image group command) so it should be present since is2.

How reproducible:
No idea on how often this could happen in the real world, it's probably rare.
Anyway respecting the timings described in the steps to reproduce would trigger this 100%.

Steps to Reproduce:
1. move a disk from a storage domain to another
2. kill -9 vdsm when the copy (successfully) ended (but before the deleteImage command)

Actual results:
DeleteImageGroupVDSCommand fails leaving the image also on the source:

2013-10-08 14:15:02,321 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 98da1408-948d-4cab-9a8b-418914be9f07, ignoreFailoverLimit = false, storageDomainId = c8c60dca-3ec8-4ea0-8135-d929070055cb, imageGroupId = c3164bb9-8bdb-4673-8675-a86943bebfe7, postZeros = false, forceDelete = false), log id: 34d2038b
2013-10-08 14:15:02,332 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) Failed in DeleteImageGroupVDS method
2013-10-08 14:15:02,332 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) Error code StoragePoolUnknown and error message IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',)
2013-10-08 14:15:02,333 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-50) IrsBroker::Failed::DeleteImageGroupVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',)
2013-10-08 14:15:02,337 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) FINISH, DeleteImageGroupVDSCommand, log id: 34d2038b
2013-10-08 14:15:02,337 ERROR [org.ovirt.engine.core.bll.RemoveImageCommand] (pool-6-thread-50) Command org.ovirt.engine.core.bll.RemoveImageCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',) (Failed with error StoragePoolUnknown and code 309)

Expected results:
The source image should be removed (retry when the SPM is up?).

Additional info:
An audit log message is displayed:

2013-Oct-08, 14:15
Possible failure while deleting DiskToMove from the source Storage Domain BlockDomain1 during the move operation. The Storage Domain may be manually cleaned-up from possible leftovers (User:admin@internal).

Comment 1 Allon Mureinik 2014-06-16 12:57:41 UTC
This command will be overhauled in 3.6 anyway...

Comment 2 Yaniv Lavi 2015-10-22 08:20:19 UTC
Can we close this issue due to the SPM work?

Comment 3 Liron Aravot 2015-10-25 12:42:04 UTC
The reported issue isn't solved by the spm removal related work.

This bug suggests that when we fail to perform operations (like deletion of the source image in that case) we may decide to retry to perform as the operation might succeed later on.

changing the header accordingly.

Comment 4 Liron Aravot 2015-10-25 12:56:38 UTC
just to clarify - my comment isn't referring on how the retrying mechanism will be implemented (there are multiple options for that), AFAIK the existing support today is for unlimited number of retries only.

as part of the spm removal work (as many flows are being rewritten to use the CoCo framewark) we'll take into consideration the retries issue.

Comment 5 Red Hat Bugzilla Rules Engine 2015-11-30 19:00:04 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 Sandro Bonazzola 2016-05-02 10:03:39 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 7 Yaniv Lavi 2016-05-23 13:18:26 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 8 Yaniv Lavi 2016-05-23 13:22:30 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 9 Yaniv Lavi 2017-02-06 13:00:00 UTC
Could have been resolved with the HSM changes, please test.

Comment 10 Kevin Alon Goldblatt 2017-02-16 15:03:00 UTC
Tested with the following code:
-----------------------------------------------
ovirt-engine-4.1.1-0.1.el7.noarch
rhevm-4.1.1-0.1.el7.noarch
vdsm-4.19.5-1.el7ev.x86_64


Verified with the following scenario:
----------------------------------------------
Steps to Reproduce:
1. move a disk from a storage domain to another
2. Stop the vdsm when the copy (successfully) ended (but before the deleteImage command)


The remnants of the image are successfully clean up when the vdsm comes up again after recovery (fencing)

Moving to VERIFIED!