Description of problem:
When creating a snapshot with or without memory, the VDSM may crash.
For that we introduced the recovery mechanism. The mechanism was in thought on memory snapshot only, and catching if libvirt domain job is running. If the domain job ended and VDSM comes back afterwards it may report the snapshot as succeeded, making a false report and unused entities (like memory disk) that look as usable. It may also lead later on to merge problems.
- If we are in the middle of the domain job, we will catch it in recover since we check the domain job and that's fine.
- If we are after the domain job, it will act as succeed(semi-true, since the operation might have failed and we missed it somehow).
- If we are before the domain job, it will act wrong, marking it as succeed.
Steps to Reproduce:
1. Create a VM with large amount of memory (minimum 10 GB, to give enough time to kill vdsmd service on the host).
2. Load the VM memory.
3. Create snapshot with memory (again, to have time).
a. Kill VDSM before it reaches to the libvirt call (after writing the snapshot metadata to the VM - hard to reproduce without code modification).
b. Kill VDSM after the call to libvirt (without letting it to go up again), abort the snapshot (domain job) on libvirt - to fail the snapshot creation, start VDSM again afterwards.
VDSM reports snapshot as succeed, makes the engine report likewise. While the snapshot failed or not executed.
Report on failure when needed.
It will be much easier to reproduce (and possibly checking non-memory snapshot as well), if you modify the code to add sleep on the right spot.
1. follow bug description steps to reproduce.
This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.