Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1585455 - [downstream clone - 4.2.4] Move disk failed but delete was called on source sd, losing all the data
[downstream clone - 4.2.4] Move disk failed but delete was called on source s...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
4.1.9
x86_64 Linux
urgent Severity urgent
: ovirt-4.2.4
: ---
Assigned To: Benny Zlotnik
Kevin Alon Goldblatt
: Performance, Reopened, ZStream
Depends On: 1574346
Blocks:
  Show dependency treegraph
 
Reported: 2018-06-03 03:08 EDT by RHV Bugzilla Automation and Verification Bot
Modified: 2018-08-04 11:06 EDT (History)
13 users (show)

See Also:
Fixed In Version: ovirt-engine-4.2.4.1
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1574346
Environment:
Last Closed: 2018-06-27 06:02:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 91068 master MERGED core: mark child commands as failed if parent failed 2018-06-03 03:10 EDT
oVirt gerrit 91657 ovirt-engine-4.2 MERGED core: mark child commands as failed if parent failed 2018-06-03 03:10 EDT
oVirt gerrit 91671 master MERGED core: use getTaskGroupSuccess instead of getSucceeded 2018-06-03 03:10 EDT
Red Hat Product Errata RHSA-2018:2071 None None None 2018-06-27 06:03 EDT

  None (edit)
Description RHV Bugzilla Automation and Verification Bot 2018-06-03 03:08:59 EDT
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1574346 +++
======================================================================

Description of problem:

During a simultaneous cold move of several disks from SD 968889ce-61e9-4ec8-8d0c-f558e5bafca3 to 2c899223-3517-43fb-a70f-6bdeabeaa696, several images were lost.

For example, this is how volume 2a0183b3-4b44-4d78-bd53-7ba656575ff0 was lost:

1. Start MoveImageGroup

2018-04-24 20:41:59,462-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'}), log id: 126e5c36

2. Stuck, no progress, in ~10min, it seems it hit transaction reaper limit.

2018-04-24 20:51:42,071-07 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Command 'MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'})' execution failed: PreparedStatementCallback; uncategorized SQLException for SQL [select * from  getupandprioritizedvds(?)]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK

2018-04-24 20:51:42,082-07 WARN  [com.arjuna.ats.arjuna] (org.ovirt.thread.pool-6-thread-48) ARJUNA012077: Abort called on already aborted atomic action 0:ffffac12023d:4208cdd5:5ac2a531:4703582

3. The move command finishes with failure

2018-04-24 20:51:42,071-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] FINISH, MoveImageGroupVDSCommand, log id: 126e5c36

[...]

2018-04-24 20:52:13,108-07 ERROR [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand' with failure.

4. But MoveImageGroupCommand command is considered successfull (???)

2018-04-24 20:52:13,205-07 INFO  [org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand' successfully.

5. And we go ahead removing the image from the source SD (???)

2018-04-24 20:52:13,266-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f

6. Say the disk moved failed:

2018-04-24 20:52:14,611-07 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), Correlation ID: f8c1b451-718f-4daf-9d46-2ee4eb718d60, Job ID: d6ddf5af-110a-4794-89c2-7444e94db12a, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: User admin@internal have failed to move disk xxx to domain yyy.

7. And the Remove succeeded

2018-04-24 20:52:13,266-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f
2018-04-24 20:52:13,914-07 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] CommandMultiAsyncTasks::attachTask: Attaching task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' to command 'd5c0f99c-d22c-4c16-adec-5f4163dd0271'.
2018-04-24 21:16:22,730-07 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler1) [356504d9] BaseAsyncTask::onTaskEndSuccess: Task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' (Parent Command 'RemoveImage', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully.

NOTES:
* VDS never received any move/copy command, it only received the delete
* delete was with postZero=u'true'

Version-Release number of selected component (if applicable):
VERSION: rhevm-4.1.9.2-0.1.el7.noarch

How reproducible:
Unknown, apparently triggered by several simultaneous moves.

Steps to Reproduce:
Done via SDK

(Originally by Germano Veit Michel)
Comment 5 RHV Bugzilla Automation and Verification Bot 2018-06-03 03:09:15 EDT
2 problems?
- MoveImageGroupCommand did not finish successfully?
- a deadlock somewhere?

(Originally by Germano Veit Michel)
Comment 7 RHV Bugzilla Automation and Verification Bot 2018-06-03 03:09:25 EDT
Freddy, you fixed the same bug 1538840 for 4.2, how complicated will it be to backport to 4.1?

(Originally by Tal Nisan)
Comment 10 RHV Bugzilla Automation and Verification Bot 2018-06-03 03:09:41 EDT
Folks, I might be missing something buy IMO bug 1538840 has nothing to do with this.

bug 1538840:
- only happens via UI (c#11)
- user triggers 2 simultaneous move for the same disk
- second move deletes the source img
- move/copy commands are sent to the vds

this bug:
- via sdk
- there is no 2 simultaneous move for the same disk
- there are several simultaneous disk moves, which look deadlocked (DB)
- first move deletes the source img
- move/copy commands never sent to the vds

And looking at the logs from both BZ, they are totally different.

https://gerrit.ovirt.org/#/c/87315/ adds a check to not run a move twice, there is no moving twice here that I can see. Each disk was moved once, by SDK.

I would love to be proven wrong, but I am afraid this bug is still in the wild and it can happen again.

We have a customer waiting on this to move several hundreds of disks, we must be 100% sure this is fixed.

Reopening.

(Originally by Germano Veit Michel)
Comment 11 RHV Bugzilla Automation and Verification Bot 2018-06-03 03:09:47 EDT
The root cause seems to be the same, the end action of the move command wrongly triggers the delete of the source image, but Freddy let's double check indeed

(Originally by Tal Nisan)
Comment 12 RHV Bugzilla Automation and Verification Bot 2018-06-03 03:09:51 EDT
It looks like a different issue.
Benny and I started to look at it.
There two things here to handle:
- The parent command should not be considered successful if the child command failed.
- Investigate the root cause of the timeout in the transaction. (Dead lock?)

(Originally by Fred Rolland)
Comment 14 RHV Bugzilla Automation and Verification Bot 2018-06-03 03:10:01 EDT
Germano any chance for a thread dump from the time where the move waited? I'm wondering why the vds call didn't get timeout after 3 minutes

(Originally by Roy Golan)
Comment 15 RHV Bugzilla Automation and Verification Bot 2018-06-03 03:10:06 EDT
(In reply to Roy Golan from comment #13)
> Germano any chance for a thread dump from the time where the move waited?
> I'm wondering why the vds call didn't get timeout after 3 minutes

Hi Roy,

Nope. It's quite rare for us to get a chance to spot these "in the wild" and request a thread dump while it is still happening. 

Because we are quite unlikely to get thread dumps unless it can be reproduced, maybe there could be a way to generate them automatically when a thread is blocked for more than X time, similar to what the kernel does?

(Originally by Germano Veit Michel)
Comment 21 Kevin Alon Goldblatt 2018-06-19 11:33:12 EDT
Verified with the following code:
--------------------------------------
ovirt-engine-4.2.4.4-0.1.el7_3.noarch
vdsm-4.20.31-1.el7ev.x86_64


Verified with the following scenario:
--------------------------------------
1. Ran a cold move of iscsi disk to another iscsi domain which failed due to error injection >>>> the target is successfully clean up and the source is no longer removed as before
2. Ran the LSM again and this time is completed successfully



Moving to VERIFIED!
Comment 22 Kevin Alon Goldblatt 2018-06-19 11:36:20 EDT
(In reply to Kevin Alon Goldblatt from comment #21)
> Verified with the following code:
> --------------------------------------
> ovirt-engine-4.2.4.4-0.1.el7_3.noarch
> vdsm-4.20.31-1.el7ev.x86_64
> 
> 
> Verified with the following scenario:
> --------------------------------------
> 1. Ran a cold move of iscsi disk to another iscsi domain which failed due to
> error injection >>>> the target is successfully clean up and the source is
> no longer removed as before
> 2. Ran the LSM again and this time is completed successfully
CORRECTION - Ran the COLD MOVE again and this time it competed successfully
> 
> 
> 
> Moving to VERIFIED!
Comment 24 errata-xmlrpc 2018-06-27 06:02:42 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2071

Note You need to log in before you can comment on or make changes to this bug.