Bug 1585455

Summary:	[downstream clone - 4.2.4] Move disk failed but delete was called on source sd, losing all the data
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	RHV bug bot <rhv-bugzilla-bot>
Component:	ovirt-engine	Assignee:	Benny Zlotnik <bzlotnik>
Status:	CLOSED ERRATA	QA Contact:	Kevin Alon Goldblatt <kgoldbla>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.1.9	CC:	bzlotnik, ebenahar, frolland, gveitmic, lsurette, lveyde, rbalakri, Rhev-m-bugs, srevivo, tnisan, ykaul
Target Milestone:	ovirt-4.2.4	Keywords:	Performance, Reopened, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ovirt-engine-4.2.4.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1574346	Environment:
Last Closed:	2018-06-27 10:02:42 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1574346
Bug Blocks:

Description RHV bug bot 2018-06-03 07:08:59 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1574346 +++
======================================================================

Description of problem:

During a simultaneous cold move of several disks from SD 968889ce-61e9-4ec8-8d0c-f558e5bafca3 to 2c899223-3517-43fb-a70f-6bdeabeaa696, several images were lost.

For example, this is how volume 2a0183b3-4b44-4d78-bd53-7ba656575ff0 was lost:

1. Start MoveImageGroup

2018-04-24 20:41:59,462-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'}), log id: 126e5c36

2. Stuck, no progress, in ~10min, it seems it hit transaction reaper limit.

2018-04-24 20:51:42,071-07 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Command 'MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'})' execution failed: PreparedStatementCallback; uncategorized SQLException for SQL [select * from  getupandprioritizedvds(?)]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK

2018-04-24 20:51:42,082-07 WARN  [com.arjuna.ats.arjuna] (org.ovirt.thread.pool-6-thread-48) ARJUNA012077: Abort called on already aborted atomic action 0:ffffac12023d:4208cdd5:5ac2a531:4703582

3. The move command finishes with failure

2018-04-24 20:51:42,071-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] FINISH, MoveImageGroupVDSCommand, log id: 126e5c36

[...]

2018-04-24 20:52:13,108-07 ERROR [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand' with failure.

4. But MoveImageGroupCommand command is considered successfull (???)

2018-04-24 20:52:13,205-07 INFO  [org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand' successfully.

5. And we go ahead removing the image from the source SD (???)

2018-04-24 20:52:13,266-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f

6. Say the disk moved failed:

2018-04-24 20:52:14,611-07 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), Correlation ID: f8c1b451-718f-4daf-9d46-2ee4eb718d60, Job ID: d6ddf5af-110a-4794-89c2-7444e94db12a, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: User admin@internal have failed to move disk xxx to domain yyy.

7. And the Remove succeeded

2018-04-24 20:52:13,266-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f
2018-04-24 20:52:13,914-07 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] CommandMultiAsyncTasks::attachTask: Attaching task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' to command 'd5c0f99c-d22c-4c16-adec-5f4163dd0271'.
2018-04-24 21:16:22,730-07 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler1) [356504d9] BaseAsyncTask::onTaskEndSuccess: Task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' (Parent Command 'RemoveImage', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully.

NOTES:
* VDS never received any move/copy command, it only received the delete
* delete was with postZero=u'true'

Version-Release number of selected component (if applicable):
VERSION: rhevm-4.1.9.2-0.1.el7.noarch

How reproducible:
Unknown, apparently triggered by several simultaneous moves.

Steps to Reproduce:
Done via SDK

(Originally by Germano Veit Michel)

Comment 5 RHV bug bot 2018-06-03 07:09:15 UTC

2 problems?
- MoveImageGroupCommand did not finish successfully?
- a deadlock somewhere?

(Originally by Germano Veit Michel)

Comment 7 RHV bug bot 2018-06-03 07:09:25 UTC

Freddy, you fixed the same bug 1538840 for 4.2, how complicated will it be to backport to 4.1?

(Originally by Tal Nisan)

Comment 10 RHV bug bot 2018-06-03 07:09:41 UTC

Folks, I might be missing something buy IMO bug 1538840 has nothing to do with this.

bug 1538840:
- only happens via UI (c#11)
- user triggers 2 simultaneous move for the same disk
- second move deletes the source img
- move/copy commands are sent to the vds

this bug:
- via sdk
- there is no 2 simultaneous move for the same disk
- there are several simultaneous disk moves, which look deadlocked (DB)
- first move deletes the source img
- move/copy commands never sent to the vds

And looking at the logs from both BZ, they are totally different.

https://gerrit.ovirt.org/#/c/87315/ adds a check to not run a move twice, there is no moving twice here that I can see. Each disk was moved once, by SDK.

I would love to be proven wrong, but I am afraid this bug is still in the wild and it can happen again.

We have a customer waiting on this to move several hundreds of disks, we must be 100% sure this is fixed.

Reopening.

(Originally by Germano Veit Michel)

Comment 11 RHV bug bot 2018-06-03 07:09:47 UTC

The root cause seems to be the same, the end action of the move command wrongly triggers the delete of the source image, but Freddy let's double check indeed

(Originally by Tal Nisan)

Comment 12 RHV bug bot 2018-06-03 07:09:51 UTC

It looks like a different issue.
Benny and I started to look at it.
There two things here to handle:
- The parent command should not be considered successful if the child command failed.
- Investigate the root cause of the timeout in the transaction. (Dead lock?)

(Originally by Fred Rolland)

Comment 14 RHV bug bot 2018-06-03 07:10:01 UTC

Germano any chance for a thread dump from the time where the move waited? I'm wondering why the vds call didn't get timeout after 3 minutes

(Originally by Roy Golan)

Comment 15 RHV bug bot 2018-06-03 07:10:06 UTC

(In reply to Roy Golan from comment #13)
> Germano any chance for a thread dump from the time where the move waited?
> I'm wondering why the vds call didn't get timeout after 3 minutes

Hi Roy,

Nope. It's quite rare for us to get a chance to spot these "in the wild" and request a thread dump while it is still happening. 

Because we are quite unlikely to get thread dumps unless it can be reproduced, maybe there could be a way to generate them automatically when a thread is blocked for more than X time, similar to what the kernel does?

(Originally by Germano Veit Michel)

Comment 21 Kevin Alon Goldblatt 2018-06-19 15:33:12 UTC

Verified with the following code:
--------------------------------------
ovirt-engine-4.2.4.4-0.1.el7_3.noarch
vdsm-4.20.31-1.el7ev.x86_64


Verified with the following scenario:
--------------------------------------
1. Ran a cold move of iscsi disk to another iscsi domain which failed due to error injection >>>> the target is successfully clean up and the source is no longer removed as before
2. Ran the LSM again and this time is completed successfully



Moving to VERIFIED!

Comment 22 Kevin Alon Goldblatt 2018-06-19 15:36:20 UTC

(In reply to Kevin Alon Goldblatt from comment #21)
> Verified with the following code:
> --------------------------------------
> ovirt-engine-4.2.4.4-0.1.el7_3.noarch
> vdsm-4.20.31-1.el7ev.x86_64
> 
> 
> Verified with the following scenario:
> --------------------------------------
> 1. Ran a cold move of iscsi disk to another iscsi domain which failed due to
> error injection >>>> the target is successfully clean up and the source is
> no longer removed as before
> 2. Ran the LSM again and this time is completed successfully
CORRECTION - Ran the COLD MOVE again and this time it competed successfully
> 
> 
> 
> Moving to VERIFIED!

Comment 24 errata-xmlrpc 2018-06-27 10:02:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2071

Comment 25 Franta Kust 2019-05-16 13:07:39 UTC

BZ<2>Jira Resync