Bug 2012832

Summary: Snapshot recovery reports false result
Product: [oVirt] vdsm Reporter: Liran Rotenberg <lrotenbe>
Component: CoreAssignee: Liran Rotenberg <lrotenbe>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.40.13CC: ahadas, bugs, nsimsolo
Target Milestone: ovirt-4.4.10Keywords: ZStream
Target Release: 4.40.100Flags: pm-rhel: ovirt-4.4+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.40.100 Doc Type: Bug Fix
Doc Text:
When creating a snapshot and entering the recovery mode, a false result could appear. Now, it will report the right result of the snapshot operation based on the new volumes usage.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-19 07:00:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Liran Rotenberg 2021-10-11 12:00:52 UTC
Description of problem:
When creating a snapshot with or without memory, the VDSM may crash.
For that we introduced the recovery mechanism. The mechanism was in thought on memory snapshot only, and catching if libvirt domain job is running. If the domain job ended and VDSM comes back afterwards it may report the snapshot as succeeded, making a false report and unused entities (like memory disk) that look as usable. It may also lead later on to merge problems.

To summary:
- If we are in the middle of the domain job, we will catch it in recover since we check the domain job and that's fine.
- If we are after the domain job, it will act as succeed(semi-true, since the operation might have failed and we missed it somehow).
- If we are before the domain job, it will act wrong, marking it as succeed.

Steps to Reproduce:
1. Create a VM with large amount of memory (minimum 10 GB, to give enough time to kill vdsmd service on the host).
2. Load the VM memory.
3. Create snapshot with memory (again, to have time).
4. 
   a. Kill VDSM before it reaches to the libvirt call (after writing the snapshot metadata to the VM - hard to reproduce without code modification).
   b. Kill VDSM after the call to libvirt (without letting it to go up again), abort the snapshot (domain job) on libvirt - to fail the snapshot creation, start VDSM again afterwards.

Actual results:
VDSM reports snapshot as succeed, makes the engine report likewise. While the snapshot failed or not executed.

Expected results:
Report on failure when needed.

Additional info:
It will be much easier to reproduce (and possibly checking non-memory snapshot as well), if you modify the code to add sleep on the right spot.

Comment 2 Nisim Simsolo 2021-12-02 12:11:36 UTC
Verified:
ovirt-engine-4.4.10-0.17.el8ev
vdsm-4.40.100.1-1.el8ev.x86_64
libvirt-daemon-7.6.0-6.module+el8.5.0+13051+7ddbe958.x86_64
qemu-kvm-6.0.0-33.module+el8.5.0+13041+05be2dc6.x86_64

Verification scenario:
1. follow bug description steps to reproduce.

Comment 3 Sandro Bonazzola 2022-01-19 07:00:13 UTC
This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.