Description of problem: The SnapshotVDSCommand can get timed out in the engine however can be successful in the vdsm side. This can be because of a network drop between the manager or hypervisor when the command is issued in the engine side or can be because of bugs like Bug 1687345. However, if this command gets timeout in the engine side, the engine directly sends a destroy command to remove the newly created volume without checking if the volumes are still used by the VM. This will result in SPM deleting the new volume although the volume is still used by the VM. When VM is running in any host other than SPM, the dm device created for this volume still exists and VM will be writing to the disk even though the LV doesn't exist. This can corrupt/cause outage of more than 1 VMs. In the LVM point of view, this LV is already deleted. So when a user creates a new snapshot or a disk, the SPM will happily allocate the same disk blocks to the recently deleted LV as these blocks are free in LVM point of view. However, from the other host, a VM is already using these blocks. Now both VM writes into the same blocks and consequently leads to i/o error and outage and corruption. Version-Release number of selected component (if applicable): rhvm-4.3.4.3-0.1.el7.noarch ovirt-engine-4.3.4.3-0.1.el7.noarch How reproducible: 100% Steps to Reproduce: To reproduce we have to timeout the SnapshotVDSCommand in the engine. We can block the connectivity between the manager and hypervisor immediately after engine sends the SnapshotVDSCommand command. Reproducer steps in the bug 1687345 should also work. Actual results: Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM Expected results: Engine should check if the volume is in used by the VM by checking the XML before reverting the snapshot operation and deleting the volume. Additional info:
Removing "orphan" volumes after snapshot failure seem to have been introduced by BZ1497355 (https://gerrit.ovirt.org/#/c/91658/), so IIUC 4.2.4 and higher are affected by this bug. These exceptions need to be better handled.
sync2jira
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
(In reply to Evelina Shames from comment #11) > The same bug was verified for ovirt-engine-4.3.6.5-0.1 and vdsm-4.30.30-1 - > Bug 1746730, and I'm trying to verify on : > ovirt-engine-4.4.0-0.13.master.el7.noarch > vdsm-4.40.0-164.git38a19bb.el8ev.x86_64 > > > With the same steps: > - On host, change in /usr/lib/python3.6/site-packages/vdsm/API.py: > s = vm.snapshot(snapDrives, memoryParams, frozen=frozen) > import time > time.sleep(190) > return s > - Restart vdsm > - Power on VM on this host > - Try to create live snapshot > - Operation fails -> But in engine log the following doesn't appear: > 'appears to be in use by VM' > > Benny, is it ok? > > Engine log is attached > (2020-01-02 11:33:10,858+02 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.CreateImageVDSCommand] > (EE-ManagedThreadFactory-engine-Thread-187417) > [e4a80c85-1e5e-45c6-9c71-87b12238e336] START, CreateImageVDSCommand( > CreateImageVDSCommandParameters:{storagePoolId='b3e907a2-cea0-4eda-b1f9- > 9d451aa5a571', ignoreFailoverLimit='false', > storageDomainId='f69faf83-6f96-40f9-86e6-e865df601bbd', > imageGroupId='5007efa1-9f41-479a-bb86-315e9eada5ae', > imageSizeInBytes='1358954496', volumeFormat='RAW', > newImageId='13aa556b-4ffe-4ee7-a7fd-28241448ca94', imageType='Sparse', > newImageDescription='{"DiskAlias": > "golden_env_mixed_virtio_1_0_snapshot_memory","DiskDescription":"Memory > snapshot disk for snapshot 'snp1' of VM 'golden_env_mixed_virtio_1_0' (VM > ID: '6d0b5d42-59f6-42d1-9088-180b63c410f6')"}', > imageInitialSizeInBytes='0'}), log id: 299cd3d0 > ) I see it: 2020-01-02 11:36:27,179+02 WARN [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [e4a80c85-1e5e-45c6-9c71-87b12238e336] Image '12e06ed6-5045-47e7-89e7-0ce7efd7465e' appears to be in use by VM '6d0b5d42-59f6-42d1-9088-180b63c410f6', skipping deletion
Ohh sorry, missed it. Moving to verified. ovirt-engine-4.4.0-0.13.master.el7.noarch vdsm-4.40.0-164.git38a19bb.el8ev.x86_64
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247