1496399 – [downstream clone - 4.1.7] Shutdown of a vm during snapshot deletion renders the disk invalid

Bug 1496399 - [downstream clone - 4.1.7] Shutdown of a vm during snapshot deletion renders the disk invalid

Summary: [downstream clone - 4.1.7] Shutdown of a vm during snapshot deletion renders ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.1.7
Target Release:	---
Assignee:	Ala Hino
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1467928
Blocks:	1384321
TreeView+	depends on / blocked

Reported:	2017-09-27 09:53 UTC by rhev-integ
Modified:	2020-08-13 09:44 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ovirt-engine-4.1.7.4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1467928
Environment:
Last Closed:	2017-11-07 17:27:54 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:3138	normal	SHIPPED_LIVE	Red Hat Virtualization Manager (ovirt-engine) 4.1.7	2017-11-07 22:22:33 UTC
oVirt gerrit	82179	master	MERGED	core: Retry failed live merge commands	2017-09-27 09:55:42 UTC
oVirt gerrit	82282	ovirt-engine-4.1	MERGED	core: Retry failed live merge commands	2017-09-27 14:52:17 UTC
oVirt gerrit	82365	ovirt-engine-4.1	MERGED	core: Add end procedure to DestroyImageCommand and DestroyImageCheckCommand	2017-10-01 07:50:26 UTC
oVirt gerrit	82420	ovirt-engine-4.1	ABANDONED	Revert "core: Retry failed live merge commands"	2017-10-01 11:55:50 UTC
oVirt gerrit	82421	ovirt-engine-4.1	ABANDONED	Revert "core: Add end procedure to DestroyImageCommand and DestroyImageCheckCommand"	2017-10-01 11:56:13 UTC
oVirt gerrit	82426	ovirt-engine-4.1	MERGED	Revert "core: Retry failed live merge commands"	2017-10-01 12:44:56 UTC
oVirt gerrit	82427	ovirt-engine-4.1	MERGED	Revert "core: Add end procedure to DestroyImageCommand and DestroyImageCheckCommand"	2017-10-01 12:45:00 UTC
oVirt gerrit	82814	ovirt-engine-4.1	MERGED	core: Retry failed live merge commands	2017-10-17 07:45:57 UTC

Description rhev-integ 2017-09-27 09:53:29 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1467928 +++
======================================================================

Created attachment 1294645 [details]
engine.log from snapshot deletion

Description of problem:
Taking a snapshot of a vm containing more than one disk and shutting down that vm during live-remove of that snapshot renders at least one disk as invalid.

Version-Release number of selected component (if applicable):
Engine: ovirt-engine-4.1.2.3-0.1.el7.noarch
host:   vdsm-4.19.15-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Created a vm with at least two disks attached.
2. Create offline snapshot (may not be relevant).
3. Put some changes on both disk, large enough to take some time for deletion.
4. delete snapshot, shutdown vm during that activity (issue "init 0" / "halt" on vm)
5. engine does process the snapshot deletion for some time, but aborts tasks finally - without having the snapshot of both disks deleted.

Actual results:
one or several disks are marked as invalid, depending on the amount of additional disks.

Expected results:
Snapshot is removed and disks are not marked as invalid

Additional info:
Starting the vm with an invalid disk will be rejected from the qemu process. This will render the vm to be useless as the only way to get rid of that status is to remove the disk.

(Originally by Andreas Bleischwitz)

Comment 1 rhev-integ 2017-09-27 09:53:38 UTC

Created attachment 1294646 [details]
vdsm.log from SPM-host

(Originally by Andreas Bleischwitz)

Comment 4 rhev-integ 2017-09-27 09:53:48 UTC

Hi Andreas,

I am trying to reproduce this issue.
I did the following:
1. created a VM with 4 disks
2. created a snapshot
3. copied data to each disk
4. deleted the snapshot
5. powered-off the VM while deleting the snapshot.

The delete operation failed, the snapshot marked as OK and the status of each disk was illegal. All as expected and I am able to start the VM again and to delete the snapshot.

Can you please elaborate what do you mean by disks marked as invalid? Where/how do you see that?

(Originally by Ala Hino)

Comment 5 rhev-integ 2017-09-27 09:53:53 UTC

Hi Ala,

after my deletion of the snapshot was marked as failed, the disk within the snapshot was also marked as failed.

See Virtual Machines -> [vm] -> Snapshots -> [snapshot] -> disks.

Currently I have not been able to remove *any* of my snapshots from that test-machine. I also have not been able to start the VM as one of the disk have been reported as invalid (Bad volume specification).

(Originally by Andreas Bleischwitz)

Comment 11 rhev-integ 2017-09-27 09:54:23 UTC

Thanks Andreas.

It seems that one of the delete operations succeeded, hence 33b202bd-55e7-4a0f-b6a1-b9057aee8099 doesn't exist.

The attached Vdsm log is partial. Can you please upload full Vdsm log?

(Originally by Ala Hino)

Comment 12 rhev-integ 2017-09-27 09:54:28 UTC

Created attachment 1297436 [details]
vdsm.log part 1

(Originally by Andreas Bleischwitz)

Comment 13 rhev-integ 2017-09-27 09:54:33 UTC

Created attachment 1297437 [details]
vdsm.log part 2

(Originally by Andreas Bleischwitz)

Comment 14 rhev-integ 2017-09-27 09:54:39 UTC

Created attachment 1297438 [details]
vdsm.log part 3

(Originally by Andreas Bleischwitz)

Comment 15 rhev-integ 2017-09-27 09:54:44 UTC

Andreas,

Can you please upload the SPM log as well?

It seems that after the VM was shutdown, there was an attempt to delete the snapshot while the VM is down (aka cold merge), is this correct? If so, I'd like to ask you to try the flow again but this time without doing cold merge, and see whether it is possible to start the VM after it was shutdown during live merge.

(Originally by Ala Hino)

Comment 16 rhev-integ 2017-09-27 09:54:48 UTC

The engine log seems partial, can you please send the full log?

(Originally by Ala Hino)

Comment 18 rhev-integ 2017-09-27 09:54:58 UTC

Created attachment 1303777 [details]
engine.log

(Originally by Andreas Bleischwitz)

Comment 20 rhev-integ 2017-09-27 09:55:08 UTC

If you have the SPM log of the last failure and can upload it, that could be helpful further analyzing.

(Originally by Ala Hino)

Comment 21 rhev-integ 2017-09-27 09:55:13 UTC

Hi Ala,

those logs have been rotated into nirvana unfortunately. I will append a complete set of logs after I have had time to re-produce that issue.

Are there some more than engine.log and vdsm.log you would need?

(Originally by Andreas Bleischwitz)

Comment 22 rhev-integ 2017-09-27 09:55:17 UTC

Please remember to upload the logs of the SPM and the host running the VM, in addition to the engine.

It would be very helpful if you can document every step you perform - number of disks created, number of snapshot created, the time you perform the shutdown - is it specific time or some random time?

Also, the chain info (vdsm and qemu) would be useful - before the merge and after the shutdown.

After the live merge and the shutdown, do you perform a cold merge?
If yes, please try to run the VM *before* and after the cold merge, and send the chain info after the cold merge.

(Originally by Ala Hino)

Comment 23 rhev-integ 2017-09-27 09:55:21 UTC

Hi Andreas,

Any news on this?

(Originally by Ala Hino)

Comment 24 rhev-integ 2017-09-27 09:55:27 UTC

Pushing to 4.1.6 until we're able to reproduce.

(Originally by Allon Mureinik)

Comment 27 Allon Mureinik 2017-10-01 13:08:48 UTC

This patch caused OST failures and was reverted. Moving back to ASSIGNED.

Comment 28 Raz Tamir 2017-10-25 07:30:45 UTC

Based on comment #27 - moving back to assigned

Comment 29 Allon Mureinik 2017-10-25 08:58:23 UTC

New patches were merged after the original ones were reverted (check out the block between comment 27 and comment 28).
Moving back to ON_QA

Comment 30 Raz Tamir 2017-10-25 09:19:49 UTC

Thanks Allon

Comment 31 Kevin Alon Goldblatt 2017-10-26 13:53:34 UTC

Verified with the following code:
-----------------------------------------
ovirt-engine-4.1.7.4-0.1.el7.noarch
vdsm-4.19.35-1.el7ev.x86_64


Verified with the following scenario:
-----------------------------------------
Steps to reproduce:

Steps to reproduce:
1. On a vm with 2 disks create a snapshot.
2. Add some data
3. Live merge the leaf's parent snapshot and watch the engine logs
4. When merge steps completes, block access from the host to the engine
5. At this point, the engine will keep retrying checking the merge status, and fail because there is no access to the host
6. Shutdown the VM - snapshot deletion completes successfully
7. Start the VM, works fine - no disks are left invalid

Moving to VERIFIED!

Comment 37 errata-xmlrpc 2017-11-07 17:27:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3138

Note You need to log in before you can comment on or make changes to this bug.