Bug 1949475 - If pivot failed during live merge, top volume is left illegal, requires manual fix if vm is stopped
Summary: If pivot failed during live merge, top volume is left illegal, requires manua...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.60.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.0
: 4.50.0.3
Assignee: Roman Bednář
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-14 11:39 UTC by Nir Soffer
Modified: 2022-04-20 06:33 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.50.0.3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-20 06:33:59 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 117023 0 master MERGED tests: livemerge: verify imageSyncVolumeChain arguments 2021-10-20 16:26:33 UTC
oVirt gerrit 117167 0 master MERGED livemerge: recover from failed pivot attempt 2021-11-13 02:02:14 UTC
oVirt gerrit 117261 0 master MERGED image: allow leaf legality status recovery when syncing chain 2021-11-13 02:02:11 UTC
oVirt gerrit 117262 0 master MERGED livemerge: add helper for marking leaf volume illegal 2021-11-02 14:36:09 UTC
oVirt gerrit 117344 0 master MERGED vm: add imageSyncVolumeChain() wrapper 2021-11-02 14:36:05 UTC
oVirt gerrit 117345 0 master MERGED image: improve log messages for volume chain sync 2021-11-02 14:36:11 UTC
oVirt gerrit 117346 0 master MERGED tests: add pivot test with unavailable storage 2021-11-13 02:02:08 UTC

Description Nir Soffer 2021-04-14 11:39:40 UTC
Description of problem:

During live merge, when libvirt reports that the block commit block job is 
read for pivot, vdsm change the top volume to ILLEGAL before trying to pivot
to the base volume.

Changing the top volume to ILLEGAL is required to avoid data corruption in
case the pivot was successful, but vdsm was killed before it could update
metadata on storage. After successful pivot, the VM is using the base volume
instead of the top volume, and new data may be written to the base volume.
If you start the VM from the top volume, filesystem on the top volume is
likely to be corrupted.

However if the pivot failed (for example bug 1945635), and the VM is stopped
starting the VM again will fail, and require manual fix of the top volume
metadata. This is likely to lead to downtime and require support.

When pivot failed, we know that the VM is still using the top volume, so 
there is no reason to keep the top volume as ILLEGAL.

Change pivot flow to restore the top volume legal state.

If pivot failed:
- Get the current chain from libvirt
- If the top volume is still in the chain, set top volume to LEGAL. 

If storage becomes inaccessible at this point restoring the volume LEGAL
state will fail. The volume will be fixed on the next pivot attempt.

Comment 1 Evelina Shames 2021-07-19 06:16:37 UTC
Hi Roman/Nir, pls provide a clear verification flow.
Thanks.

Comment 2 Nir Soffer 2021-07-26 09:55:16 UTC
I'm not sure we have a way to reproduce this issue. This happened in the
past due to a bug in libvirt, and since the bug was fixed it should never
happen.

Simulating this in real system requires a way to inject errors in libvirt
or qemu. Peter, do we have such capability?

Comment 3 Peter Krempa 2021-07-26 10:09:30 UTC
No in this instance it's not possible to simulate the outcome there was due to the bug. The issue was that the job was completed properly, but then libvirt emitted the wrong state afterwards, so our APIs can't simulate that since the bug is now fixed.

Comment 5 sshmulev 2022-03-06 09:25:21 UTC
Verified with automation regression tests of tier1-3 related with Live merge.
Didn't detect any failures in the live merge tests related to this bug.

Versions:
vdsm	vdsm-4.50.0.5-1.el8ev.x86_64
ovirt-engine	ovirt-engine-4.5.0-582.gd548206.185.el8ev.noarch
libvirt	libvirt-8.0.0-2.module+el8.6.0+14025+ca131e0a.x86_64

Comment 7 Sandro Bonazzola 2022-04-20 06:33:59 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.