Bug 1821164
| Summary: | Failed snapshot creation can cause data corruption of other VMs | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> | |
| Component: | ovirt-engine | Assignee: | Liran Rotenberg <lrotenbe> | |
| Status: | CLOSED ERRATA | QA Contact: | Shir Fishbain <sfishbai> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | unspecified | CC: | aefrat, aoconnor, bzlotnik, gveitmic, lrotenbe, lsvaty, michal.skrivanek, mkalinin, mlehrer, pelauter, rdlugyhe, tnisan | |
| Target Milestone: | ovirt-4.4.0 | Keywords: | ZStream | |
| Target Release: | 4.4.1 | Flags: | lsvaty:
testing_plan_complete-
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
While the RHV Manager is creating a virtual machine (VM) snapshot, it can time out and fail while trying to freeze the file system. If this happens, more than one VM can write data to the same logical volume and corrupt the data on it. In the current release, you can prevent this condition by configuring the Manager to freeze the VM's guest filesystems before creating a snapshot. To enable this behavior, run the engine-config tool and set the `LiveSnapshotPerformFreezeInEngine` key-value pair to `true`.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1842375 (view as bug list) | Environment: | ||
| Last Closed: | 2020-08-04 13:22:22 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1842375 | |||
|
Description
Roman Hodain
2020-04-06 08:32:34 UTC
This issue may get fixed by
Bug 1749284 - Change the Snapshot operation to be asynchronous
But there may still be potential for this behaviour if we do not handle all the corner cases properly.
Benny, you think it will be possible to check the Domain XML dump to figure out if the VM is currently using an image that we are going to rollback and in that case roll forward? (In reply to Tal Nisan from comment #9) > Benny, you think it will be possible to check the Domain XML dump to figure > out if the VM is currently using an image that we are going to rollback and > in that case roll forward? It's strange because we have this check[1], I checked the logs and it seems the xml dump didn't contain the new volumes so from engine POV they weren't used (they are part of the dump later on, when live merge runs), I didn't find logs from host when the VM runs so not entirely sure what happened [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/CreateSnapshotCommand.java#L216 well, there's timing to consider between getVolumeChain() and actual removal. Would be best to have such a check on vdsm side, perhaps? As a safeguard in case engine decides to delete an active volume....for whatever reason. (In reply to Michal Skrivanek from comment #11) > well, there's timing to consider between getVolumeChain() and actual > removal. Would be best to have such a check on vdsm side, perhaps? As a > safeguard in case engine decides to delete an active volume....for whatever > reason. yes, read the bug over, and if the snapshot creation didn't reach the call to libvirt we'll still see the original chain (in the previous bug freeze passed fine, the memory dump took too long)... so we can't really do this reliably in the engine Do we have a way to tell if a volume is used by a VM in vdsm though? Image removal is an SPM operation Maybe we can acquire a volume lease and inquire when trying to delete Here is one observation.
The snapshot creation continued after we received:
2020-03-30 20:45:13,918+0200 WARN (jsonrpc/0) [virt.vm] (vmId='cdb7c691-41be-4f96-808c-4d4421462a36') Unable to freeze guest filesystems: internal error: unable to execute QEMU agent command 'guest-fsfreeze-freeze': timeout when try to receive Frozen event from VSS provider: Unspecified error (vm:4262)
This is generated by the qemu agent. The agent waits for the fsFreeze even for 10s, but this message was reported minutes after the fsFreeze was initiated. So the guest agent may get stuck even before triggering the freeze. Would it be better not to rely on the agent and simply fail the fsFreeze according to a timeout suitable for the vdsm workflow? We can see that this operation can be blocking.
The snapshot creation completed successfully and ready to be used
Verified with the following steps:
1. Adding sleep to the host at /usr/lib/python2.7/site-packages/vdsm/virt/vm.py
2. Restart vdsmd
3. On the engine engine-config -s LiveSnapshotPerformFreezeInEngine=true
engine-config -s LiveSnapshotTimeoutInMinutes=1
4. Restart ovirt-engine service
5. Run the new VM on the host [1]
6. Create a snapshot without memory
**From this moment the LiveSnapshotPerformFreezeInEngine configured by default to true.
Versions:
ovirt-engine-4.4.1.1-0.5.el8ev.noarch
vdsm-4.40.18-1.el8ev.x86_64
from engine.log:
2020-06-02 11:42:58,879+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-172) [4bd62674-8346-4a68-b88e-6e65ae59bdd9] START, FreezeVDSCommand(HostName = host_mixed_2, VdsAndVmIDVDSParametersBase:{hostId='562abf2c-fd8d-4280-80bd-454bfbf61328', vmId='d56f0bd6-656f-456a-b181-d85de806621e'}), log id: 4809807d
2020-06-02 11:45:58,982+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-172) [4bd62674-8346-4a68-b88e-6e65ae59bdd9] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=5022, message=Message timeout which can be caused by communication issues]]'
2020-06-02 11:45:58,983+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-172) [4bd62674-8346-4a68-b88e-6e65ae59bdd9] HostName = host_mixed_2
2020-06-02 11:45:58,984+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-172) [4bd62674-8346-4a68-b88e-6e65ae59bdd9] FINISH, FreezeVDSCommand, return: , log id: 4809807d
2020-06-02 11:46:04,068+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [4bd62674-8346-4a68-b88e-6e65ae59bdd9] START, ThawVDSCommand(HostName = host_mixed_2, VdsAndVmIDVDSParametersBase:{hostId='562abf2c-fd8d-4280-80bd-454bfbf61328', vmId='d56f0bd6-656f-456a-b181-d85de806621e'}), log id: 7c683e21
2020-06-02 11:46:08,478+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [4bd62674-8346-4a68-b88e-6e65ae59bdd9] FINISH, ThawVDSCommand, return: , log id: 7c683e21
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247 |