Bug 1984209
| Summary: | VDSM reports failed snapshot to engine, but it succeeded. Then engine deletes the volume and causes data corruption. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | vdsm | Assignee: | Liran Rotenberg <lrotenbe> |
| Status: | CLOSED ERRATA | QA Contact: | Tamir <tamir> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.4.6 | CC: | ahadas, bcholler, dfodor, emarcus, eshenitz, lsurette, mzamazal, pagranat, srevivo, ycui |
| Target Milestone: | ovirt-4.4.8 | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | vdsm-4.40.80.4 | Doc Type: | Bug Fix |
| Doc Text: |
Previously, when failing to execute a snapshot and re-executing it later, the second try would fail due to using the previous execution data. In this release, this data will be used only when needed, in recovery mode.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-08 14:11:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Germano Veit Michel
2021-07-21 00:29:49 UTC
If we were looking for justification to make changes in this mechanism (bz 1985973) then here we have one We can try reproduce it if needed by modifying the vdsm code to fail preparing the volume(s) on the first attempt and see what happens on the next attempt I see few problems:
The main one is when failing, the metadata stay.
I just set raise on the same place of the 1st snapshot failure and I got the metadata on the VM:
<ovirt-vm:snapshot_job>{"startTime": "1599532.231465163", "timeout": "1800", "abort": true, "completed": false, "jobUUID": "c249b960-0c47-4c79-857e-f33b4e3b79ab", "frozen": false, "memoryParams": {}}</ovirt-vm:snapshot_job>
When we start a snapshot we try to read it on the initialization phase.
self._snapshot_job = read_snapshot_md(self._vm, self._lock)
self._load_metadata()
It makes sense to have it only on recovery.
The second thing is where we look for the abort flag, currently it's only on later phases of the snapshot operation. We may want to catch it earlier.
Combining the two, the abort flag is true, we going with the 2nd snapshot and calling libvirt (pivot happens), but then we failing the job while it runs and succeeds.
We improved not long ago (4.4.7), some of the handling: https://gerrit.ovirt.org/#/c/ovirt-engine/+/113756/ but i'm not sure it's enough and we must at least clear the metadata or not read it when we aren't in recovery.
Taking into consideration https://bugzilla.redhat.com/show_bug.cgi?id=1984209#c7 QE can't reproduce the issue. We are relying on the engineers verification by changing the code. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Virtualization Host security and bug fix update [ovirt-4.4.8]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3459 |