1574346 – Move disk failed but delete was called on source sd, losing all the data

Bug 1574346 - Move disk failed but delete was called on source sd, losing all the data

Summary: Move disk failed but delete was called on source sd, losing all the data

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.9
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.3.0
Target Release:	4.3.0
Assignee:	Benny Zlotnik
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1585455
TreeView+	depends on / blocked

Reported:	2018-05-03 05:54 UTC by Germano Veit Michel
Modified:	2021-09-09 13:56 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-engine-4.3.0_alpha
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1585455 (view as bug list)
Environment:
Last Closed:	2019-05-08 12:37:35 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2019:1085	None	None	None	2019-05-08 12:37:56 UTC
oVirt gerrit	91068	'None'	MERGED	core: mark child commands as failed if parent failed	2021-01-18 14:41:22 UTC
oVirt gerrit	91657	'None'	MERGED	core: mark child commands as failed if parent failed	2021-01-18 14:40:42 UTC
oVirt gerrit	91671	'None'	MERGED	core: use getTaskGroupSuccess instead of getSucceeded	2021-01-18 14:40:43 UTC

Description Germano Veit Michel 2018-05-03 05:54:06 UTC

Description of problem:

During a simultaneous cold move of several disks from SD 968889ce-61e9-4ec8-8d0c-f558e5bafca3 to 2c899223-3517-43fb-a70f-6bdeabeaa696, several images were lost.

For example, this is how volume 2a0183b3-4b44-4d78-bd53-7ba656575ff0 was lost:

1. Start MoveImageGroup

2018-04-24 20:41:59,462-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'}), log id: 126e5c36

2. Stuck, no progress, in ~10min, it seems it hit transaction reaper limit.

2018-04-24 20:51:42,071-07 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Command 'MoveImageGroupVDSCommand( MoveImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', dstDomainId='2c899223-3517-43fb-a70f-6bdeabeaa696', vmId='00000000-0000-0000-0000-000000000000', op='Copy', postZero='true', discard='false', force='false'})' execution failed: PreparedStatementCallback; uncategorized SQLException for SQL [select * from  getupandprioritizedvds(?)]; SQL state [null]; error code [0]; IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_ROLLEDBACK

2018-04-24 20:51:42,082-07 WARN  [com.arjuna.ats.arjuna] (org.ovirt.thread.pool-6-thread-48) ARJUNA012077: Abort called on already aborted atomic action 0:ffffac12023d:4208cdd5:5ac2a531:4703582

3. The move command finishes with failure

2018-04-24 20:51:42,071-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-6-thread-48) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] FINISH, MoveImageGroupVDSCommand, log id: 126e5c36

[...]

2018-04-24 20:52:13,108-07 ERROR [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand' with failure.

4. But MoveImageGroupCommand command is considered successfull (???)

2018-04-24 20:52:13,205-07 INFO  [org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.MoveImageGroupCommand' successfully.

5. And we go ahead removing the image from the source SD (???)

2018-04-24 20:52:13,266-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f

6. Say the disk moved failed:

2018-04-24 20:52:14,611-07 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), Correlation ID: f8c1b451-718f-4daf-9d46-2ee4eb718d60, Job ID: d6ddf5af-110a-4794-89c2-7444e94db12a, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: User admin@internal have failed to move disk xxx to domain yyy.

7. And the Remove succeeded

2018-04-24 20:52:13,266-07 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{runAsync='true', storagePoolId='6c406fcc-ffdd-49b0-8c17-d351174eb09b', ignoreFailoverLimit='false', storageDomainId='968889ce-61e9-4ec8-8d0c-f558e5bafca3', imageGroupId='9d1ed754-e8f4-4169-920e-7706d906241b', postZeros='true', discard='false', forceDelete='false'}), log id: 2c9b813f
2018-04-24 20:52:13,914-07 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler8) [f8c1b451-718f-4daf-9d46-2ee4eb718d60] CommandMultiAsyncTasks::attachTask: Attaching task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' to command 'd5c0f99c-d22c-4c16-adec-5f4163dd0271'.
2018-04-24 21:16:22,730-07 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler1) [356504d9] BaseAsyncTask::onTaskEndSuccess: Task 'eb3d98fa-4848-476a-bdf4-b017dd8907ae' (Parent Command 'RemoveImage', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully.

NOTES:
* VDS never received any move/copy command, it only received the delete
* delete was with postZero=u'true'

Version-Release number of selected component (if applicable):
VERSION: rhevm-4.1.9.2-0.1.el7.noarch

How reproducible:
Unknown, apparently triggered by several simultaneous moves.

Steps to Reproduce:
Done via SDK

Comment 4 Germano Veit Michel 2018-05-03 06:17:47 UTC

2 problems?
- MoveImageGroupCommand did not finish successfully?
- a deadlock somewhere?

Comment 6 Tal Nisan 2018-05-03 09:17:50 UTC

Freddy, you fixed the same bug 1538840 for 4.2, how complicated will it be to backport to 4.1?

Comment 9 Germano Veit Michel 2018-05-03 22:58:40 UTC

Folks, I might be missing something buy IMO bug 1538840 has nothing to do with this.

bug 1538840:
- only happens via UI (c#11)
- user triggers 2 simultaneous move for the same disk
- second move deletes the source img
- move/copy commands are sent to the vds

this bug:
- via sdk
- there is no 2 simultaneous move for the same disk
- there are several simultaneous disk moves, which look deadlocked (DB)
- first move deletes the source img
- move/copy commands never sent to the vds

And looking at the logs from both BZ, they are totally different.

https://gerrit.ovirt.org/#/c/87315/ adds a check to not run a move twice, there is no moving twice here that I can see. Each disk was moved once, by SDK.

I would love to be proven wrong, but I am afraid this bug is still in the wild and it can happen again.

We have a customer waiting on this to move several hundreds of disks, we must be 100% sure this is fixed.

Reopening.

Comment 10 Tal Nisan 2018-05-07 09:43:59 UTC

The root cause seems to be the same, the end action of the move command wrongly triggers the delete of the source image, but Freddy let's double check indeed

Comment 11 Fred Rolland 2018-05-08 09:28:02 UTC

It looks like a different issue.
Benny and I started to look at it.
There two things here to handle:
- The parent command should not be considered successful if the child command failed.
- Investigate the root cause of the timeout in the transaction. (Dead lock?)

Comment 13 Roy Golan 2018-05-10 13:10:17 UTC

Germano any chance for a thread dump from the time where the move waited? I'm wondering why the vds call didn't get timeout after 3 minutes

Comment 14 Germano Veit Michel 2018-05-10 23:07:58 UTC

(In reply to Roy Golan from comment #13)
> Germano any chance for a thread dump from the time where the move waited?
> I'm wondering why the vds call didn't get timeout after 3 minutes

Hi Roy,

Nope. It's quite rare for us to get a chance to spot these "in the wild" and request a thread dump while it is still happening. 

Because we are quite unlikely to get thread dumps unless it can be reproduced, maybe there could be a way to generate them automatically when a thread is blocked for more than X time, similar to what the kernel does?

Comment 18 Kevin Alon Goldblatt 2018-09-20 14:39:21 UTC

Verified with the following code:
--------------------------------------
ovirt-engine-4.3.0-0.0.master.20180815091554.gitd5455ea.el7.noarch
vdsm-4.30.0-527.gitcec1054.el7.x86_64


Verified with the following scenario:
--------------------------------------
1. Ran a cold move of iscsi disk to another iscsi domain which failed due to
error injection >>>> the target is successfully clean up and the source is
no longer removed as before
2. Ran the LSM again and this time is completed successfully
CORRECTION - Ran the COLD MOVE again and this time it competed successfully



Moving to VERIFIED!

Comment 19 RHV bug bot 2018-12-10 15:13:33 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 20 RHV bug bot 2019-01-15 23:36:00 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 22 errata-xmlrpc 2019-05-08 12:37:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 23 Franta Kust 2019-05-16 13:06:08 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.