Description of problem: If my analysis is correct, when our backup software (https://storware.eu/products/vprotect/) try to delete a snapshot via API the snapshot is merged and removed from disk but sometimes not from the engine database. Version-Release number of selected component (if applicable): How reproducible: Can't reproduce. It happens randomly. Steps to Reproduce: 1. 2. 3. Actual results: A VM is randomly left in the status described in the summary because a snapshot is merged and removed from disk but not from the engine database. Expected results: When a snapshot is merged and removed from disk, the database must be updated to reflect the status on disk. Additional info:
Created attachment 1771365 [details] vdsm.log
Created attachment 1771366 [details] engine.log
Created attachment 1771373 [details] engine db dump
Here's the relevant output of: vdsm-tool dump-volume-chains 459011cf-ebb6-46ff-831d-8ccfafd82c8a ... image: 2fd9b340-14ce-42cd-a880-8f75eb3b72e7 - 19fa612d-2f93-46a5-93e7-c74afc2937fa status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 45902462976 - e77e2773-eecb-46a6-87c7-14df14f7d95b status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 19327352832 - 76321669-ac65-4426-8f2e-31b3e2d8050b status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 1073741824 ... Here's the database status: engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='2fd9b340-14ce-42cd-a880-8f75eb3b72e7'; image_guid | parentid | imagestatus | vm_snapshot_id | volume_type | volume_format | active --------------------------------------+--------------------------------------+-------------+--------------------------------------+-------------+---------------+-------- 76321669-ac65-4426-8f2e-31b3e2d8050b | e77e2773-eecb-46a6-87c7-14df14f7d95b | 1 | b8be90ca-a1be-460e-875a-3eda2f31a27e | 2 | 4 | t 6180c2e8-141a-4126-bcb1-5af5caa98175 | 19fa612d-2f93-46a5-93e7-c74afc2937fa | 4 | 281a4331-89fa-4544-b6ab-22c0e5ac6459 | 2 | 4 | f 19fa612d-2f93-46a5-93e7-c74afc2937fa | 00000000-0000-0000-0000-000000000000 | 1 | 6488f4c8-efc4-43df-aafc-497fb7d51fbb | 2 | 4 | f e77e2773-eecb-46a6-87c7-14df14f7d95b | 6180c2e8-141a-4126-bcb1-5af5caa98175 | 1 | f44957b0-08b4-4133-b067-3ee1d9f86f02 | 2 | 4 | f (4 rows)
Did you try to remove the snapshot from engine UI or API? Engine should detect that the volume 6180c2e8-141a-4126-bcb1-5af5caa98175 is not in the current chain, and mark the snapshot removal as successful. Engine will probably try to remove the volume from storage. Since the volume does not exist, engine should succeed. There are 2 issues here: - Snapshot deletion failed - there may be valid reason to failed deletion, known issues (e.g. race between extend and commit, fixed in 4.4.5), or external issues. Snapshot left in illegal state is not necessary a bug. - If snapshot deletion failed, retrying the snapshot deletion should succeed. This is the basic design of this feature. I expect that vprotect will retry the snapshot deletion operation during the same backup window.
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
The operation causing the issue is the removal of a snapshot done via API by vprotect. In the UI I see the exclamation mark with the warning and the subsequent attempts to remove the snapshot via UI or API (by vprotect) fail. I can provide other logs or info if you need them. I'm on 4.4.5.11 on the engine and the hypervisor are Ovirt node 4.4.5 with no pending updates.
(In reply to Giorgio Biacchi from comment #7) > The operation causing the issue is the removal of a snapshot done via API by > vprotect. In the UI I see the exclamation mark with the warning and the > subsequent attempts to remove the snapshot via UI or API (by vprotect) fail. This is expected, it removing snapshot fails, engine mark the snapshot as illegal, since some of the data was merged, and the snapshot does not represent the state of the vm at the time the snapshot was created. As I explained in comment 5, in this state you can try to delete the snapshot again from the UI or the API, and the operation should succeed. Please try that and if the operation fails again, provide engine and vdsm logs showing the how the operation failed. I want to understand if this is the expected case of failed snapshot removal that is fixed by the next attempt, or unexpected case when failed snapshot cannot be recovered and require some manual fix in the database.
I just tried (again) to remove the snapshot from the UI without success. I filed this bug only after many unsuccessful attempts to remove it via UI and after an email to the Ovirt mailing list. I'm going to attach the requested logs for my last attempt.
Created attachment 1771619 [details] engine.log
Created attachment 1771620 [details] vdsm.log
In vdsm log we see: 1. Starting merge 2021-04-12 13:22:50,526+0000 INFO (jsonrpc/2) [api.virt] START merge(drive={'poolID': '42135536-980b-4ea9-ab66-b850ed8c1f2b', 'volumeID': 'e77e2773-eecb-46a6-87c7-14df14f7d95b', 'imageID': '2fd9b340-14ce-42cd-a880-8f75eb3b72e7', 'domainID': '459011cf-ebb6-46ff-831d-8ccfafd82c8a'}, baseVolUUID='6180c2e8-141a-4126-bcb1-5af5caa98175', topVolUUID='e77e2773-eecb-46a6-87c7-14df14f7d95b', bandwidth='0', jobUUID='9786d676-978f-4255-ab1e-1c323089f371') from=::ffff:159.149.129.244,47040, flow_id=23f3187a-9e4a-4bad-94cb-22c69e115f04, vmId=00b50a51-049e-45dd-9cad-d01fb3b8b10e (api:48) 2. Getting volume info fails, since the volume was already removed in previous merge 2021-04-12 13:22:50,683+0000 INFO (jsonrpc/2) [vdsm.api] FINISH getVolumeInfo error=Volume does not exist: ('6180c2e8-141a-4126-bcb1-5af5caa98175',) from=::ffff:159.149.129.244,47040, flow_id=23f3187a-9e4a-4bad-94cb-22c69e115f04, task_id=b5f03185-e5ab-4fcc-b6df-74c060fe240d (api:52) 2021-04-12 13:22:50,683+0000 INFO (jsonrpc/2) [storage.TaskManager.Task] (Task='b5f03185-e5ab-4fcc-b6df-74c060fe240d') aborting: Task is aborted: "value=Volume does not exist: ('6180c2e8-141a-4126-bcb1-5af5caa98175',) abortedcode=201" (task:1190) 3. Merge failed 2021-04-12 13:22:50,684+0000 INFO (jsonrpc/2) [api.virt] FINISH merge return={'status': {'code': 52, 'message': "Merge failed: {'top': 'e77e2773-eecb-46a6-87c7-14df14f7d95b', 'base': '6180c2e8-141a-4126-bcb1-5af5caa98175', 'job': '9786d676-978f-4255-ab1e-1c323089f371', 'reason': 'Unable to get volume info for domain 459011cf-ebb6-46ff-831d-8ccfafd82c8a volume 6180c2e8-141a-4126-bcb1-5af5caa98175'}"}} from=::ffff:159.149.129.244,47040, flow_id=23f3187a-9e4a-4bad-94cb-22c69e115f04, vmId=00b50a51-049e-45dd-9cad-d01fb3b8b10e (api:54) Vdsm response is less useful than it could be, since it does not include the reason fro the failure, and engine cannot use this error to complete the merge. However when engine get a MergeFailed error, it should check the current chain, and detect that the merge was completed successfully.
Nir, is there something for us to do here?
(In reply to Eyal Shenitzky from comment #13) > Nir, is there something for us to do here? Yes, when engine is starting a merge and the merge fails, engine should check the actual image chain with vdsm. If the volume to be merged is not in the chain it means the merge was successful. Engine uses this logic during merge, but for some reason it is not used when starting the merge. The basic design of live merge is that after any error we can run merge again to recover. This bug found a major hole in current code.
(In reply to Nir Soffer from comment #14) > This bug found a major hole in current code. And we still have that hole a year later. Still worth a fix?
(In reply to Michal Skrivanek from comment #17) > (In reply to Nir Soffer from comment #14) > > > This bug found a major hole in current code. > > And we still have that hole a year later. Still worth a fix? I think it is We do something similar in create-snapshot to see if the created volume is being used, should be similar
Worth looking into whether we can also do that in order to prevent the situation of having a volume in the database that no longer exists on the storage as well..
QE doesn't have the capacity to verify this bug during 4.5.1.
Any news about this issue? The situation has become unsustainable, every day, following backup we have to manually fix illegal snapshots. To fix it we have to live migrate the disk with illegal snapshot to another storage domain (after this operation snapshot is no longer marked as illegal), then we can remove the snapshot. And eventually move back disk to the original storage domain. You can imagine this operation is complex, specially for big disks. Another element to consider is the VM having this issue are always the same (more or less), with no apparent common pattern (windows and linux, small and big disks, VM with one or two disks, running on different hypervisors, ...), and it's not happening every time.
(In reply to Giulio Casella from comment #22) > Any news about this issue? The situation has become unsustainable, every > day, following backup we have to manually fix illegal snapshots. > To fix it we have to live migrate the disk with illegal snapshot to another > storage domain (after this operation snapshot is no longer marked as > illegal), then we can remove the snapshot. And eventually move back disk to > the original storage domain. You can imagine this operation is complex, > specially for big disks. > Another element to consider is the VM having this issue are always the same > (more or less), with no apparent common pattern (windows and linux, small > and big disks, VM with one or two disks, running on different hypervisors, > ...), and it's not happening every time. a potential fix is posted and being reviewed now
Should this bug be fixed in ovirt-engine-4.5.2.2? If so I'm afraid the issue is still present (I'm on ovirt-engine-4.5.2.4-1.el8.noarch).
Hi Giulio, have you tried to remove the illegal snapshot? The operation should succeed now.
Hi Mark, tried (many times), with no luck. See attached screenshot.
Created attachment 1908247 [details] Snapshot removal in UI
Please provide logs - engine and vdsm.
Created attachment 1908249 [details] Log for engine
Created attachment 1908250 [details] Log for host host01
Created attachment 1908251 [details] Log for host host04
Attached logs for engine and vdsm (from 2 hypervisors). VMs involved in recent failures are "dylan" (running on host01) and "fileserver" (running on host04).
Hi Giulio, Looking at the logs, it seems that the issue you are facing currently is not the same issue described above. 2022-08-29 09:17:11,890+02 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [0eced56f-689d-422b-b15c-20b824377b08] Failed to live merge. Top volume f8f84b1c-53ab-4c99-a01d-743ed3d7859b is still in qemu chain [0ea89fbc-d39a-48ff-aa2b-0381d79d7714, 55bb387f-01a6-41b6-b585-4bcaf2ea5e32, f8f84b1c-53ab-4c99-a01d-743ed3d7859b] This log entry suggests that the volume cannot be merged for some reason. VDSM logs you have provided do not cover the time when this error occured. In any case, since it is another issue, I suggest to move this discussion to [ovirt-users].
Or file a new bug
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
I think we ended up again in this situation. We opened another bug ( https://bugzilla.redhat.com/show_bug.cgi?id=2122525 ) for snapshot related problems and, until today, we were able to recover from illegal snapshots status back to legal live migrating the disk(s) image(s) to another data domain. Today one of this tasks failed and now I have this situation: [root@host03 ~]# vdsm-tool dump-volume-chains fbade149-e448-4f02-b382-afd28c0d9ffd Images volume chains (base volume first) image: fdfa6a94-35ed-4a9b-b8ef-438d9dfe8ea1 - 55bb387f-01a6-41b6-b585-4bcaf2ea5e32 status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 5234491392 - ead5265d-dacb-4727-95c4-93ff46a496e6 status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 805306368 - af37524f-c311-4ff2-adc2-232b5c02539b status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 8858370048 engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='fdfa6a94-35ed-4a9b-b8ef-438d9dfe8ea1'; image_guid | parentid | imagestatus | vm_snapshot_id | volume_type | volume_format | active --------------------------------------+--------------------------------------+-------------+--------------------------------------+-------------+---------------+-------- 9dbb2076-c805-4d89-8209-081ecd46c59c | ead5265d-dacb-4727-95c4-93ff46a496e6 | 1 | 84d87810-b61e-4c07-bffc-cb6e420ca21e | 2 | 4 | f af37524f-c311-4ff2-adc2-232b5c02539b | 9dbb2076-c805-4d89-8209-081ecd46c59c | 1 | 8618fd25-b951-4926-aafa-5e93c35e1fdd | 2 | 4 | t ead5265d-dacb-4727-95c4-93ff46a496e6 | 55bb387f-01a6-41b6-b585-4bcaf2ea5e32 | 1 | bcf88e2a-15ac-4a47-8d27-96452a8ccce6 | 2 | 4 | f 55bb387f-01a6-41b6-b585-4bcaf2ea5e32 | 00000000-0000-0000-0000-000000000000 | 1 | 20a89231-643b-465a-bd19-a84a1e266ac1 | 2 | 4 | f We tried to remove again the snapshots from the UI but the task fails. To furtherly complicate the things the VM have two disks, but the situation of the other disk seems ok: image: c4c09af6-976f-40cb-92a1-5e71bcea19ae - eb4a433b-eb2c-4a7e-8899-98692c2ae005 status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 2684354560000, truesize: 2257676402688 - ff3dfbd4-266f-427d-8939-600a5c90ec3d status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 2684354560000, truesize: 8053063680 - 93b26585-fb7b-48f0-87f2-9d2eac731e2b status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 2684354560000, truesize: 9529458688 engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='c4c09af6-976f-40cb-92a1-5e71bcea19ae'; image_guid | parentid | imagestatus | vm_snapshot_id | volume_type | volume_format | active --------------------------------------+--------------------------------------+-------------+--------------------------------------+-------------+---------------+-------- 93b26585-fb7b-48f0-87f2-9d2eac731e2b | ff3dfbd4-266f-427d-8939-600a5c90ec3d | 1 | 8618fd25-b951-4926-aafa-5e93c35e1fdd | 2 | 4 | t ff3dfbd4-266f-427d-8939-600a5c90ec3d | eb4a433b-eb2c-4a7e-8899-98692c2ae005 | 4 | 42ef6853-d7ac-4967-b93f-bb5425963891 | 2 | 4 | f eb4a433b-eb2c-4a7e-8899-98692c2ae005 | 00000000-0000-0000-0000-000000000000 | 1 | 20a89231-643b-465a-bd19-a84a1e266ac1 | 2 | 4 | f (3 rows) I'm going to attach engine and vdsm logs asap.
Created attachment 1909818 [details] engine.log
Created attachment 1909819 [details] vdsm.log
Hi Giorgio, please provide the relevant content of the snapshots table select snapshot_id,snapshot_type,status,description from snapshots where vm_id='e058515c-8276-44d7-9f3c-e5429aa593ed'; (I assume the VM id is e058515c-8276-44d7-9f3c-e5429aa593ed, right?)
Hi, here's the query result: engine=# select snapshot_id,snapshot_type,status,description from snapshots where vm_id='e058515c-8276-44d7-9f3c-e5429aa593ed'; snapshot_id | snapshot_type | status | description --------------------------------------+---------------+--------+------------------------------------------------------------ 42ef6853-d7ac-4967-b93f-bb5425963891 | REGULAR | OK | fileserver_Disk2 Auto-generated for Live Storage Migration 8618fd25-b951-4926-aafa-5e93c35e1fdd | ACTIVE | OK | Active VM bcf88e2a-15ac-4a47-8d27-96452a8ccce6 | REGULAR | OK | fileserver_Disk1 Auto-generated for Live Storage Migration 20a89231-643b-465a-bd19-a84a1e266ac1 | REGULAR | OK | vProtect 2022-09-05 12:12:44.587166 (4 rows)
Hi, I still have the VM for which I reopened this bug with snapshots in illegal state. I did not change anything in the case you need other info, but now Ovirt 4.5.3 is out and I'd like to know how to proceed and bring back the situation to normal state. There's still a weekly task running for snapshot cleanup by vProtect that fails with the following messages: 18 ott 2022, 12:32:37 Failed to delete snapshot 'vProtect 2022-09-05 12:12:44.587166' for VM 'fileserver'. 18 ott 2022, 12:31:39 VDSM host03 command MergeVDS failed: Merge failed: {'job': 'cb28c1b3-4084-4cf1-a8e6-bd6c3db958cc', 'reason': "block copy still active: disk 'sdb' already in active block job"} Can you please tell me if it's safe to proceed with the update to ovirt-4.5.3 or give me the instructions to fix the database? Thanks
Hi, any news about this bug? I saw Arik removed 4.5.3 milestone as a target, still worth updating to 4.5.3?
(In reply to Giulio Casella from comment #43) > Hi, > any news about this bug? I saw Arik removed 4.5.3 milestone as a target, > still worth updating to 4.5.3? Definitely not to 4.5.3 which was already released but we need to take another look at this
(In reply to Giorgio Biacchi from comment #37) > I think we ended up again in this situation. We opened another bug ( > https://bugzilla.redhat.com/show_bug.cgi?id=2122525 ) for snapshot related > problems and, until today, we were able to recover from illegal snapshots > status back to legal live migrating the disk(s) image(s) to another data > domain. yeah, that indeed seems different in the original report we had three volumes on the storage: > vdsm-tool dump-volume-chains 459011cf-ebb6-46ff-831d-8ccfafd82c8a > > ... > > image: 2fd9b340-14ce-42cd-a880-8f75eb3b72e7 > > - 19fa612d-2f93-46a5-93e7-c74afc2937fa > status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, > type: SPARSE, capacity: 53687091200, truesize: 45902462976 > > - e77e2773-eecb-46a6-87c7-14df14f7d95b > status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, > type: SPARSE, capacity: 53687091200, truesize: 19327352832 > > - 76321669-ac65-4426-8f2e-31b3e2d8050b > status: OK, voltype: LEAF, format: COW, legality: LEGAL, > type: SPARSE, capacity: 53687091200, truesize: 1073741824 and four volumes in the database: > engine=# select > image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format, > active from images where > image_group_id='2fd9b340-14ce-42cd-a880-8f75eb3b72e7'; > image_guid | parentid > | imagestatus | vm_snapshot_id | volume_type | > volume_format | active > --------------------------------------+-------------------------------------- > +-------------+--------------------------------------+-------------+--------- > ------+-------- > 76321669-ac65-4426-8f2e-31b3e2d8050b | e77e2773-eecb-46a6-87c7-14df14f7d95b > | 1 | b8be90ca-a1be-460e-875a-3eda2f31a27e | 2 | > 4 | t > 6180c2e8-141a-4126-bcb1-5af5caa98175 | 19fa612d-2f93-46a5-93e7-c74afc2937fa > | 4 | 281a4331-89fa-4544-b6ab-22c0e5ac6459 | 2 | > 4 | f > 19fa612d-2f93-46a5-93e7-c74afc2937fa | 00000000-0000-0000-0000-000000000000 > | 1 | 6488f4c8-efc4-43df-aafc-497fb7d51fbb | 2 | > 4 | f > e77e2773-eecb-46a6-87c7-14df14f7d95b | 6180c2e8-141a-4126-bcb1-5af5caa98175 > | 1 | f44957b0-08b4-4133-b067-3ee1d9f86f02 | 2 | > 4 | f > (4 rows) and all the snapshots that these volumes are correlated with existed in the database: db=# select snapshot_id, vm_id, snapshot_type,description from snapshots where vm_id='00b50a51-049e-45dd-9cad-d01fb3b8b10e'; snapshot_id | vm_id | snapshot_type | description --------------------------------------+--------------------------------------+---------------+------------------------------------- f44957b0-08b4-4133-b067-3ee1d9f86f02 | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | REGULAR | vProtect 2021-04-12 15:24:21.870915 281a4331-89fa-4544-b6ab-22c0e5ac6459 | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | REGULAR | vProtect 2021-04-09 15:00:13.983112 b8be90ca-a1be-460e-875a-3eda2f31a27e | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | ACTIVE | Active VM 6488f4c8-efc4-43df-aafc-497fb7d51fbb | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | REGULAR | vProtect 2021-04-09 09:00:13.482308 so we could recover from that state as suggested in comment 14 but the scenario that is described in comment 37 is different since the vm snapshot 84d87810-b61e-4c07-bffc-cb6e420ca21e that the volume we want to remove is correlated with doesn't exist. therefore I'm closing this bug again and I've changed the title of this bug to better reflect what was being addressed as for the issue described in comment 37 - we really need to know more about that vm snapshot 84d87810-b61e-4c07-bffc-cb6e420ca21e that is not covered at all by the provided logs in order to understand what is going on here. if we can get more information on that, please file a github issue on that: https://github.com/oVirt/ovirt-engine/issues (In reply to Giorgio Biacchi from comment #42) > Can you please tell me if it's safe to proceed with the update to > ovirt-4.5.3 or give me the instructions to fix the database? I don't see why it should block you from upgrading to 4.5.3