Description of problem: The ovirt-engine is marking the snapshot status as OK before it sends the "snapshot" command to the VM. So if there is a backup automation tool like Commvault is checking the status of the snapshot by looking into this "status", it would assume that the snapshot operation is complete since it would return "OK". So the tool will proceed to the next step of attaching the snapshot disk to the agent VM. Since the status is OK, it will also complete successfully. This ends up in attaching the snapshot disk to the backup agent VM before the actual snapshot operation is complete. Also, the SnapshotVDSCommand can get failed (example bug 1572801) and this will result in attaching an "invalid" snapshot disk to the backup agent VM. Version-Release number of selected component (if applicable): rhvm-4.2.6.4-0.1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. I added a delay in the vdsm code where it freezes the guest filesystem so that I can replicate a snapshot failure. 2. The snapshot status will be changed to OK before it sends the snapshot command to the VM. Actual results: The snapshot status is changed to OK immediately after it creates the volume and before it sends the "snapshot" command to the VM. Expected results: The snapshot status should be changed to "OK" only after the complete snapshot operation. Additional info:
Eyal please have a look. Arik, do you have any insights from Virt side?
(In reply to Tal Nisan from comment #1) > Eyal please have a look. > Arik, do you have any insights from Virt side? That looks like a regression caused by the relatively recent changes in the create-snapshot command. The snapshot should indeed remain locked until all tasks are finished.
Benny, Can you please take a look?
(In reply to Arik from comment #3) > (In reply to Tal Nisan from comment #1) > > Eyal please have a look. > > Arik, do you have any insights from Virt side? > > That looks like a regression caused by the relatively recent changes in the > create-snapshot command. The snapshot should indeed remain locked until all > tasks are finished. Actually, I was wrong, it seems that we unlocked the snapshot before calling the live-snapshot verb also in 4.1 [1], before those changes. [1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.1/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/snapshots/CreateAllSnapshotsFromVmCommand.java#L401-L404
So, is it a Virt issue or Storage?
(In reply to Tal Nisan from comment #6) > So, is it a Virt issue or Storage? It can go either way but I would keep it as Storage since the storage team is the last to introduce a major change to the way this command operates.
*** Bug 1620087 has been marked as a duplicate of this bug. ***
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops
Tested using: ovirt-engine-4.3.0-0.6.alpha2.el7.noarch vdsm-4.30.4-1.el7ev.x86_64 - Tested both with and without 30 seconds delay in VDSM. - Tested with both VM states - up and down. - There were 5 preallocated disks of size 20G each. The snapshot creation process took much more than 10 seconds (~60 seconds). The whole operation time (till it completed), the snapshot status was locked (using REST API). Moving to VERIFIED
WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085
sync2jira