+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1574346 +++ ====================================================================== Description of problem: During a simultaneous cold move of several disks from SD 968889ce-61e9-4ec8-8d0c-f558e5bafca3 to 2c899223-3517-43fb-a70f-6bdeabeaa696, several images were lost. For example, this is how volume 2a0183b3-4b44-4d78-bd53-7ba656575ff0 was lost: 1. Start MoveImageGroup 2018-04-24 20:41:59,462-07 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'}), log id: 126e5c36 2. Stuck, no progress, in ~10min, it seems it hit transaction reaper limit. 2018-04-24 20:51:42,071-07 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Command 'MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'})' execution failed: PreparedStatementCallback; uncategorized SQLException for SQL [select * from getupandprioritizedvds(?)]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK 2018-04-24 20:51:42,082-07 WARN [com.arjuna.ats.arjuna] (org.ovirt.thread.pool-6-thread-48) ARJUNA012077: Abort called on already aborted atomic action 0:ffffac12023d:4208cdd5:5ac2a531:4703582 3. The move command finishes with failure 2018-04-24 20:51:42,071-07 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] FINISH, MoveImageGroupVDSCommand, log id: 126e5c36 [...] 2018-04-24 20:52:13,108-07 ERROR [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand' with failure. 4. But MoveImageGroupCommand command is considered successfull (???) 2018-04-24 20:52:13,205-07 INFO [org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand' successfully. 5. And we go ahead removing the image from the source SD (???) 2018-04-24 20:52:13,266-07 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f 6. Say the disk moved failed: 2018-04-24 20:52:14,611-07 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), Correlation ID: f8c1b451-718f-4daf-9d46-2ee4eb718d60, Job ID: d6ddf5af-110a-4794-89c2-7444e94db12a, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: User admin@internal have failed to move disk xxx to domain yyy. 7. And the Remove succeeded 2018-04-24 20:52:13,266-07 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f 2018-04-24 20:52:13,914-07 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] CommandMultiAsyncTasks::attachTask: Attaching task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' to command 'd5c0f99c-d22c-4c16-adec-5f4163dd0271'. 2018-04-24 21:16:22,730-07 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler1) [356504d9] BaseAsyncTask::onTaskEndSuccess: Task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' (Parent Command 'RemoveImage', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully. NOTES: * VDS never received any move/copy command, it only received the delete * delete was with postZero=u'true' Version-Release number of selected component (if applicable): VERSION: rhevm-4.1.9.2-0.1.el7.noarch How reproducible: Unknown, apparently triggered by several simultaneous moves. Steps to Reproduce: Done via SDK (Originally by Germano Veit Michel)
2 problems? - MoveImageGroupCommand did not finish successfully? - a deadlock somewhere? (Originally by Germano Veit Michel)
Freddy, you fixed the same bug 1538840 for 4.2, how complicated will it be to backport to 4.1? (Originally by Tal Nisan)
Folks, I might be missing something buy IMO bug 1538840 has nothing to do with this. bug 1538840: - only happens via UI (c#11) - user triggers 2 simultaneous move for the same disk - second move deletes the source img - move/copy commands are sent to the vds this bug: - via sdk - there is no 2 simultaneous move for the same disk - there are several simultaneous disk moves, which look deadlocked (DB) - first move deletes the source img - move/copy commands never sent to the vds And looking at the logs from both BZ, they are totally different. https://gerrit.ovirt.org/#/c/87315/ adds a check to not run a move twice, there is no moving twice here that I can see. Each disk was moved once, by SDK. I would love to be proven wrong, but I am afraid this bug is still in the wild and it can happen again. We have a customer waiting on this to move several hundreds of disks, we must be 100% sure this is fixed. Reopening. (Originally by Germano Veit Michel)
The root cause seems to be the same, the end action of the move command wrongly triggers the delete of the source image, but Freddy let's double check indeed (Originally by Tal Nisan)
It looks like a different issue. Benny and I started to look at it. There two things here to handle: - The parent command should not be considered successful if the child command failed. - Investigate the root cause of the timeout in the transaction. (Dead lock?) (Originally by Fred Rolland)
Germano any chance for a thread dump from the time where the move waited? I'm wondering why the vds call didn't get timeout after 3 minutes (Originally by Roy Golan)
(In reply to Roy Golan from comment #13) > Germano any chance for a thread dump from the time where the move waited? > I'm wondering why the vds call didn't get timeout after 3 minutes Hi Roy, Nope. It's quite rare for us to get a chance to spot these "in the wild" and request a thread dump while it is still happening. Because we are quite unlikely to get thread dumps unless it can be reproduced, maybe there could be a way to generate them automatically when a thread is blocked for more than X time, similar to what the kernel does? (Originally by Germano Veit Michel)
Verified with the following code: -------------------------------------- ovirt-engine-4.2.4.4-0.1.el7_3.noarch vdsm-4.20.31-1.el7ev.x86_64 Verified with the following scenario: -------------------------------------- 1. Ran a cold move of iscsi disk to another iscsi domain which failed due to error injection >>>> the target is successfully clean up and the source is no longer removed as before 2. Ran the LSM again and this time is completed successfully Moving to VERIFIED!
(In reply to Kevin Alon Goldblatt from comment #21) > Verified with the following code: > -------------------------------------- > ovirt-engine-4.2.4.4-0.1.el7_3.noarch > vdsm-4.20.31-1.el7ev.x86_64 > > > Verified with the following scenario: > -------------------------------------- > 1. Ran a cold move of iscsi disk to another iscsi domain which failed due to > error injection >>>> the target is successfully clean up and the source is > no longer removed as before > 2. Ran the LSM again and this time is completed successfully CORRECTION - Ran the COLD MOVE again and this time it competed successfully > > > > Moving to VERIFIED!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2071
BZ<2>Jira Resync