Bug 2012832 - Snapshot recovery reports false result
Summary: Snapshot recovery reports false result
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.13
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-4.4.10
: 4.40.100
Assignee: Liran Rotenberg
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-11 12:00 UTC by Liran Rotenberg
Modified: 2022-01-19 07:00 UTC (History)
3 users (show)

Fixed In Version: vdsm-4.40.100
Doc Type: Bug Fix
Doc Text:
When creating a snapshot and entering the recovery mode, a false result could appear. Now, it will report the right result of the snapshot operation based on the new volumes usage.
Clone Of:
Environment:
Last Closed: 2022-01-19 07:00:13 UTC
oVirt Team: Virt
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43787 0 None None None 2021-10-11 12:01:41 UTC
oVirt gerrit 117051 0 None MERGED snapshot: refactor 2021-10-11 12:06:14 UTC
oVirt gerrit 117052 0 master MERGED snapshot: fix the recovery mechanism 2021-10-21 15:50:47 UTC
oVirt gerrit 117309 0 ovirt-4.4.z MERGED snapshot: refactor 2021-10-26 15:33:58 UTC
oVirt gerrit 117310 0 ovirt-4.4.z MERGED snapshot: fix the recovery mechanism 2021-10-26 15:34:01 UTC

Description Liran Rotenberg 2021-10-11 12:00:52 UTC
Description of problem:
When creating a snapshot with or without memory, the VDSM may crash.
For that we introduced the recovery mechanism. The mechanism was in thought on memory snapshot only, and catching if libvirt domain job is running. If the domain job ended and VDSM comes back afterwards it may report the snapshot as succeeded, making a false report and unused entities (like memory disk) that look as usable. It may also lead later on to merge problems.

To summary:
- If we are in the middle of the domain job, we will catch it in recover since we check the domain job and that's fine.
- If we are after the domain job, it will act as succeed(semi-true, since the operation might have failed and we missed it somehow).
- If we are before the domain job, it will act wrong, marking it as succeed.

Steps to Reproduce:
1. Create a VM with large amount of memory (minimum 10 GB, to give enough time to kill vdsmd service on the host).
2. Load the VM memory.
3. Create snapshot with memory (again, to have time).
4. 
   a. Kill VDSM before it reaches to the libvirt call (after writing the snapshot metadata to the VM - hard to reproduce without code modification).
   b. Kill VDSM after the call to libvirt (without letting it to go up again), abort the snapshot (domain job) on libvirt - to fail the snapshot creation, start VDSM again afterwards.

Actual results:
VDSM reports snapshot as succeed, makes the engine report likewise. While the snapshot failed or not executed.

Expected results:
Report on failure when needed.

Additional info:
It will be much easier to reproduce (and possibly checking non-memory snapshot as well), if you modify the code to add sleep on the right spot.

Comment 2 Nisim Simsolo 2021-12-02 12:11:36 UTC
Verified:
ovirt-engine-4.4.10-0.17.el8ev
vdsm-4.40.100.1-1.el8ev.x86_64
libvirt-daemon-7.6.0-6.module+el8.5.0+13051+7ddbe958.x86_64
qemu-kvm-6.0.0-33.module+el8.5.0+13041+05be2dc6.x86_64

Verification scenario:
1. follow bug description steps to reproduce.

Comment 3 Sandro Bonazzola 2022-01-19 07:00:13 UTC
This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.