+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1737684 +++ ====================================================================== Description of problem: The SnapshotVDSCommand can get timed out in the engine however can be successful in the vdsm side. This can be because of a network drop between the manager or hypervisor when the command is issued in the engine side or can be because of bugs like Bug 1687345. However, if this command gets timeout in the engine side, the engine directly sends a destroy command to remove the newly created volume without checking if the volumes are still used by the VM. This will result in SPM deleting the new volume although the volume is still used by the VM. When VM is running in any host other than SPM, the dm device created for this volume still exists and VM will be writing to the disk even though the LV doesn't exist. This can corrupt/cause outage of more than 1 VMs. In the LVM point of view, this LV is already deleted. So when a user creates a new snapshot or a disk, the SPM will happily allocate the same disk blocks to the recently deleted LV as these blocks are free in LVM point of view. However, from the other host, a VM is already using these blocks. Now both VM writes into the same blocks and consequently leads to i/o error and outage and corruption. Version-Release number of selected component (if applicable): rhvm-4.3.4.3-0.1.el7.noarch ovirt-engine-4.3.4.3-0.1.el7.noarch How reproducible: 100% Steps to Reproduce: To reproduce we have to timeout the SnapshotVDSCommand in the engine. We can block the connectivity between the manager and hypervisor immediately after engine sends the SnapshotVDSCommand command. Reproducer steps in the bug 1687345 should also work. Actual results: Engine deletes the leaf volume when SnapshotVDSCommand timed out without checking if the volume is still used by the VM Expected results: Engine should check if the volume is in used by the VM by checking the XML before reverting the snapshot operation and deleting the volume. Additional info: (Originally by Nijin Ashok)
Removing "orphan" volumes after snapshot failure seem to have been introduced by BZ1497355 (https://gerrit.ovirt.org/#/c/91658/), so IIUC 4.2.4 and higher are affected by this bug. These exceptions need to be better handled. (Originally by Germano Veit Michel)
sync2jira (Originally by Daniel Gur)
This bug was not merged to rhv-4.3.6-6(ovirt-engine-4.3.6.4). see: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/heads/ovirt-engine-4.3 As this is the last downstream build in 4.3.6 please retarget to 4.3.7.
Benny, See last comment as to why this bug will not be verified at 4.3.6. Also, please provide a clear and simple as possible scenario so we can qa_ack it.
Steps to reproduce are available in the description
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops
INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Tag 'ovirt-engine-4.3.5.6' doesn't contain patch 'https://gerrit.ovirt.org/102907'] gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.3.5.6 For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3010
*** Bug 1899578 has been marked as a duplicate of this bug. ***