Bug 1016794 - MoveDisksCommand is not completed if the SPM is lost between copy and delete
MoveDisksCommand is not completed if the SPM is lost between copy and delete
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: General (Show other bugs)
---
Unspecified Unspecified
unspecified Severity medium (vote)
: ovirt-4.1.1
: ---
Assigned To: Liron Aravot
Kevin Alon Goldblatt
: TestOnly
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-08 13:16 EDT by Federico Simoncelli
Modified: 2017-04-21 05:44 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-04-21 05:44:49 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.1+
rule-engine: planning_ack+
rule-engine: devel_ack+
ratamir: testing_ack+


Attachments (Terms of Use)
mvimage-delete-error.tar.gz (3.67 KB, application/gzip)
2013-10-08 13:16 EDT, Federico Simoncelli
no flags Details

  None (edit)
Description Federico Simoncelli 2013-10-08 13:16:28 EDT
Created attachment 809432 [details]
mvimage-delete-error.tar.gz

Description of problem:
If the SPM role is lost after the copy successfully finished then the delete command might fail.

Version-Release number of selected component (if applicable):
Encountered upstream on git hash 178258b, but the relevant code was introduced in 421e8ec (core: move image group command) so it should be present since is2.

How reproducible:
No idea on how often this could happen in the real world, it's probably rare.
Anyway respecting the timings described in the steps to reproduce would trigger this 100%.

Steps to Reproduce:
1. move a disk from a storage domain to another
2. kill -9 vdsm when the copy (successfully) ended (but before the deleteImage command)

Actual results:
DeleteImageGroupVDSCommand fails leaving the image also on the source:

2013-10-08 14:15:02,321 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 98da1408-948d-4cab-9a8b-418914be9f07, ignoreFailoverLimit = false, storageDomainId = c8c60dca-3ec8-4ea0-8135-d929070055cb, imageGroupId = c3164bb9-8bdb-4673-8675-a86943bebfe7, postZeros = false, forceDelete = false), log id: 34d2038b
2013-10-08 14:15:02,332 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) Failed in DeleteImageGroupVDS method
2013-10-08 14:15:02,332 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) Error code StoragePoolUnknown and error message IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',)
2013-10-08 14:15:02,333 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-50) IrsBroker::Failed::DeleteImageGroupVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',)
2013-10-08 14:15:02,337 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) FINISH, DeleteImageGroupVDSCommand, log id: 34d2038b
2013-10-08 14:15:02,337 ERROR [org.ovirt.engine.core.bll.RemoveImageCommand] (pool-6-thread-50) Command org.ovirt.engine.core.bll.RemoveImageCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',) (Failed with error StoragePoolUnknown and code 309)

Expected results:
The source image should be removed (retry when the SPM is up?).

Additional info:
An audit log message is displayed:

2013-Oct-08, 14:15
Possible failure while deleting DiskToMove from the source Storage Domain BlockDomain1 during the move operation. The Storage Domain may be manually cleaned-up from possible leftovers (User:admin@internal).
Comment 1 Allon Mureinik 2014-06-16 08:57:41 EDT
This command will be overhauled in 3.6 anyway...
Comment 2 Yaniv Lavi (Dary) 2015-10-22 04:20:19 EDT
Can we close this issue due to the SPM work?
Comment 3 Liron Aravot 2015-10-25 08:42:04 EDT
The reported issue isn't solved by the spm removal related work.

This bug suggests that when we fail to perform operations (like deletion of the source image in that case) we may decide to retry to perform as the operation might succeed later on.

changing the header accordingly.
Comment 4 Liron Aravot 2015-10-25 08:56:38 EDT
just to clarify - my comment isn't referring on how the retrying mechanism will be implemented (there are multiple options for that), AFAIK the existing support today is for unlimited number of retries only.

as part of the spm removal work (as many flows are being rewritten to use the CoCo framewark) we'll take into consideration the retries issue.
Comment 5 Red Hat Bugzilla Rules Engine 2015-11-30 14:00:04 EST
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 6 Sandro Bonazzola 2016-05-02 06:03:39 EDT
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
Comment 7 Yaniv Lavi (Dary) 2016-05-23 09:18:26 EDT
oVirt 4.0 beta has been released, moving to RC milestone.
Comment 8 Yaniv Lavi (Dary) 2016-05-23 09:22:30 EDT
oVirt 4.0 beta has been released, moving to RC milestone.
Comment 9 Yaniv Lavi (Dary) 2017-02-06 08:00:00 EST
Could have been resolved with the HSM changes, please test.
Comment 10 Kevin Alon Goldblatt 2017-02-16 10:03:00 EST
Tested with the following code:
-----------------------------------------------
ovirt-engine-4.1.1-0.1.el7.noarch
rhevm-4.1.1-0.1.el7.noarch
vdsm-4.19.5-1.el7ev.x86_64


Verified with the following scenario:
----------------------------------------------
Steps to Reproduce:
1. move a disk from a storage domain to another
2. Stop the vdsm when the copy (successfully) ended (but before the deleteImage command)


The remnants of the image are successfully clean up when the vdsm comes up again after recovery (fencing)

Moving to VERIFIED!

Note You need to log in before you can comment on or make changes to this bug.