Created attachment 809432 [details] mvimage-delete-error.tar.gz Description of problem: If the SPM role is lost after the copy successfully finished then the delete command might fail. Version-Release number of selected component (if applicable): Encountered upstream on git hash 178258b, but the relevant code was introduced in 421e8ec (core: move image group command) so it should be present since is2. How reproducible: No idea on how often this could happen in the real world, it's probably rare. Anyway respecting the timings described in the steps to reproduce would trigger this 100%. Steps to Reproduce: 1. move a disk from a storage domain to another 2. kill -9 vdsm when the copy (successfully) ended (but before the deleteImage command) Actual results: DeleteImageGroupVDSCommand fails leaving the image also on the source: 2013-10-08 14:15:02,321 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) START, DeleteImageGroupVDSCommand( storagePoolId = 98da1408-948d-4cab-9a8b-418914be9f07, ignoreFailoverLimit = false, storageDomainId = c8c60dca-3ec8-4ea0-8135-d929070055cb, imageGroupId = c3164bb9-8bdb-4673-8675-a86943bebfe7, postZeros = false, forceDelete = false), log id: 34d2038b 2013-10-08 14:15:02,332 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) Failed in DeleteImageGroupVDS method 2013-10-08 14:15:02,332 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) Error code StoragePoolUnknown and error message IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',) 2013-10-08 14:15:02,333 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-50) IrsBroker::Failed::DeleteImageGroupVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',) 2013-10-08 14:15:02,337 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-50) FINISH, DeleteImageGroupVDSCommand, log id: 34d2038b 2013-10-08 14:15:02,337 ERROR [org.ovirt.engine.core.bll.RemoveImageCommand] (pool-6-thread-50) Command org.ovirt.engine.core.bll.RemoveImageCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Unknown pool id, pool not connected: ('98da1408-948d-4cab-9a8b-418914be9f07',) (Failed with error StoragePoolUnknown and code 309) Expected results: The source image should be removed (retry when the SPM is up?). Additional info: An audit log message is displayed: 2013-Oct-08, 14:15 Possible failure while deleting DiskToMove from the source Storage Domain BlockDomain1 during the move operation. The Storage Domain may be manually cleaned-up from possible leftovers (User:admin@internal).
This command will be overhauled in 3.6 anyway...
Can we close this issue due to the SPM work?
The reported issue isn't solved by the spm removal related work. This bug suggests that when we fail to perform operations (like deletion of the source image in that case) we may decide to retry to perform as the operation might succeed later on. changing the header accordingly.
just to clarify - my comment isn't referring on how the retrying mechanism will be implemented (there are multiple options for that), AFAIK the existing support today is for unlimited number of retries only. as part of the spm removal work (as many flows are being rewritten to use the CoCo framewark) we'll take into consideration the retries issue.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
oVirt 4.0 beta has been released, moving to RC milestone.
Could have been resolved with the HSM changes, please test.
Tested with the following code: ----------------------------------------------- ovirt-engine-4.1.1-0.1.el7.noarch rhevm-4.1.1-0.1.el7.noarch vdsm-4.19.5-1.el7ev.x86_64 Verified with the following scenario: ---------------------------------------------- Steps to Reproduce: 1. move a disk from a storage domain to another 2. Stop the vdsm when the copy (successfully) ended (but before the deleteImage command) The remnants of the image are successfully clean up when the vdsm comes up again after recovery (fencing) Moving to VERIFIED!