Bug 1467928 - Shutdown of a vm during snapshot deletion renders the disk invalid
Summary: Shutdown of a vm during snapshot deletion renders the disk invalid
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.2
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ovirt-4.2.0
: ---
Assignee: Ala Hino
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks: 1496399
TreeView+ depends on / blocked
 
Reported: 2017-07-05 14:28 UTC by Andreas Bleischwitz
Modified: 2020-08-13 09:34 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Previously, live merge sometimes failed during "Merge Status", "Destroy Image", or "Destroy Image Check" commands because of network timeout, leaving the top volume in an illegal state. In the current release, the system calls are repeated until they succeed, so that network timeout does not cause live merge to fail.
Clone Of:
: 1496399 (view as bug list)
Environment:
Last Closed: 2018-05-15 17:43:21 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine.log from snapshot deletion (53.84 KB, text/plain)
2017-07-05 14:28 UTC, Andreas Bleischwitz
no flags Details
vdsm.log from SPM-host (2.55 KB, text/plain)
2017-07-05 14:29 UTC, Andreas Bleischwitz
no flags Details
vdsm.log part 1 (14.82 MB, text/plain)
2017-07-13 06:57 UTC, Andreas Bleischwitz
no flags Details
vdsm.log part 2 (14.89 MB, text/plain)
2017-07-13 06:58 UTC, Andreas Bleischwitz
no flags Details
vdsm.log part 3 (14.71 MB, text/plain)
2017-07-13 07:07 UTC, Andreas Bleischwitz
no flags Details
engine.log (10.52 MB, text/plain)
2017-07-24 17:57 UTC, Andreas Bleischwitz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1383301 0 high CLOSED Snapshot remove Live-Merge failed, After vm shutdown, start again is not possible 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2018:1488 0 None None None 2018-05-15 17:45:04 UTC
oVirt gerrit 82179 0 master MERGED core: Retry failed live merge commands 2020-07-20 11:48:00 UTC
oVirt gerrit 82418 0 master MERGED Revert "core: Retry failed live merge commands" 2020-07-20 11:48:00 UTC
oVirt gerrit 82419 0 master MERGED Revert "core: Add end procedure to DestroyImageCommand and DestroyImageCheckCommand" 2020-07-20 11:48:00 UTC
oVirt gerrit 82528 0 master MERGED core: Retry failed live merge commands 2020-07-20 11:48:00 UTC

Internal Links: 1383301

Description Andreas Bleischwitz 2017-07-05 14:28:32 UTC
Created attachment 1294645 [details]
engine.log from snapshot deletion

Description of problem:
Taking a snapshot of a vm containing more than one disk and shutting down that vm during live-remove of that snapshot renders at least one disk as invalid.

Version-Release number of selected component (if applicable):
Engine: ovirt-engine-4.1.2.3-0.1.el7.noarch
host:   vdsm-4.19.15-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Created a vm with at least two disks attached.
2. Create offline snapshot (may not be relevant).
3. Put some changes on both disk, large enough to take some time for deletion.
4. delete snapshot, shutdown vm during that activity (issue "init 0" / "halt" on vm)
5. engine does process the snapshot deletion for some time, but aborts tasks finally - without having the snapshot of both disks deleted.

Actual results:
one or several disks are marked as invalid, depending on the amount of additional disks.

Expected results:
Snapshot is removed and disks are not marked as invalid

Additional info:
Starting the vm with an invalid disk will be rejected from the qemu process. This will render the vm to be useless as the only way to get rid of that status is to remove the disk.

Comment 1 Andreas Bleischwitz 2017-07-05 14:29:19 UTC
Created attachment 1294646 [details]
vdsm.log from SPM-host

Comment 3 Ala Hino 2017-07-12 11:25:08 UTC
Hi Andreas,

I am trying to reproduce this issue.
I did the following:
1. created a VM with 4 disks
2. created a snapshot
3. copied data to each disk
4. deleted the snapshot
5. powered-off the VM while deleting the snapshot.

The delete operation failed, the snapshot marked as OK and the status of each disk was illegal. All as expected and I am able to start the VM again and to delete the snapshot.

Can you please elaborate what do you mean by disks marked as invalid? Where/how do you see that?

Comment 4 Andreas Bleischwitz 2017-07-12 11:38:42 UTC
Hi Ala,

after my deletion of the snapshot was marked as failed, the disk within the snapshot was also marked as failed.

See Virtual Machines -> [vm] -> Snapshots -> [snapshot] -> disks.

Currently I have not been able to remove *any* of my snapshots from that test-machine. I also have not been able to start the VM as one of the disk have been reported as invalid (Bad volume specification).

Comment 10 Ala Hino 2017-07-12 20:14:34 UTC
Thanks Andreas.

It seems that one of the delete operations succeeded, hence 33b202bd-55e7-4a0f-b6a1-b9057aee8099 doesn't exist.

The attached Vdsm log is partial. Can you please upload full Vdsm log?

Comment 11 Andreas Bleischwitz 2017-07-13 06:57:40 UTC
Created attachment 1297436 [details]
vdsm.log part 1

Comment 12 Andreas Bleischwitz 2017-07-13 06:58:48 UTC
Created attachment 1297437 [details]
vdsm.log part 2

Comment 13 Andreas Bleischwitz 2017-07-13 07:07:21 UTC
Created attachment 1297438 [details]
vdsm.log part 3

Comment 14 Ala Hino 2017-07-24 13:30:20 UTC
Andreas,

Can you please upload the SPM log as well?

It seems that after the VM was shutdown, there was an attempt to delete the snapshot while the VM is down (aka cold merge), is this correct? If so, I'd like to ask you to try the flow again but this time without doing cold merge, and see whether it is possible to start the VM after it was shutdown during live merge.

Comment 15 Ala Hino 2017-07-24 17:39:14 UTC
The engine log seems partial, can you please send the full log?

Comment 17 Andreas Bleischwitz 2017-07-24 17:57:12 UTC
Created attachment 1303777 [details]
engine.log

Comment 19 Ala Hino 2017-07-24 18:34:07 UTC
If you have the SPM log of the last failure and can upload it, that could be helpful further analyzing.

Comment 20 Andreas Bleischwitz 2017-07-28 10:21:42 UTC
Hi Ala,

those logs have been rotated into nirvana unfortunately. I will append a complete set of logs after I have had time to re-produce that issue.

Are there some more than engine.log and vdsm.log you would need?

Comment 21 Ala Hino 2017-07-28 11:18:35 UTC
Please remember to upload the logs of the SPM and the host running the VM, in addition to the engine.

It would be very helpful if you can document every step you perform - number of disks created, number of snapshot created, the time you perform the shutdown - is it specific time or some random time?

Also, the chain info (vdsm and qemu) would be useful - before the merge and after the shutdown.

After the live merge and the shutdown, do you perform a cold merge?
If yes, please try to run the VM *before* and after the cold merge, and send the chain info after the cold merge.

Comment 22 Ala Hino 2017-08-09 12:32:58 UTC
Hi Andreas,

Any news on this?

Comment 23 Allon Mureinik 2017-08-10 12:09:05 UTC
Pushing to 4.1.6 until we're able to reproduce.

Comment 26 Allon Mureinik 2017-09-28 12:06:38 UTC
Ala, can we please have some doctext here?

Comment 27 Allon Mureinik 2017-10-01 11:55:46 UTC
The patches have been reverted, as they seem to cause non-deterministic OST failures.
Moving back to ASSIGNED.

Comment 28 Ala Hino 2017-10-15 09:36:58 UTC
(In reply to Allon Mureinik from comment #26)
> Ala, can we please have some doctext here?

Old patches reverted; will add doctext once the bug resolved

Comment 29 Allon Mureinik 2017-10-16 15:26:55 UTC
(In reply to Ala Hino from comment #28)
> (In reply to Allon Mureinik from comment #26)
> > Ala, can we please have some doctext here?
> 
> Old patches reverted; will add doctext once the bug resolved

Ala, the patch in the master branch is now merged.
Should this be marked as MODIFIED?

Comment 30 Ala Hino 2017-10-16 18:55:43 UTC
(In reply to Allon Mureinik from comment #29)
> (In reply to Ala Hino from comment #28)
> > (In reply to Allon Mureinik from comment #26)
> > > Ala, can we please have some doctext here?
> > 
> > Old patches reverted; will add doctext once the bug resolved
> 
> Ala, the patch in the master branch is now merged.
> Should this be marked as MODIFIED?

Yup + added doctext (though still couldn't tick the doctext + sign)

Comment 34 Kevin Alon Goldblatt 2017-11-28 15:23:30 UTC
Checked with the following code:
----------------------------------
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.8-53.gitc3edfc0.el7.centos.x86_64

 
Checked with the following scenario:
----------------------------------
 
Steps to Reproduce:
1. Created a vm with at least two disks attached.
2. Create offline snapshot (may not be relevant).
3. Put some changes on both disk, large enough to take some time for
deletion.
4. delete snapshot, shutdown vm during that activity (issue "init 0" /
"halt" on vm)
5. The delete fails and one disk remains in illegal state
6. Started the vm again - works fine - delete the snapshot again - works fine

Also ran the same scenario again with a small difference
1. In step 5 wait a while longer (20 sec or so ) before killing the vm - This time the snapshot delete completes successfully
2. Starting the vm again works fine

Moving to VERIFIED

Comment 37 errata-xmlrpc 2018-05-15 17:43:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 38 Franta Kust 2019-05-16 13:06:58 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.