Bug 1948599 - Fail to remove snapshots with a disk in illegal status whose volume doesn't exist on SD
Summary: Fail to remove snapshots with a disk in illegal status whose volume doesn't e...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.4.5.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.5.2
: ---
Assignee: Mark Kemel
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-12 14:48 UTC by Giorgio Biacchi
Modified: 2022-10-30 17:09 UTC (History)
9 users (show)

Fixed In Version: ovirt-engine-4.5.2.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-30 17:09:50 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
vdsm.log (17.83 KB, text/plain)
2021-04-12 14:48 UTC, Giorgio Biacchi
no flags Details
engine.log (38.15 KB, text/plain)
2021-04-12 14:49 UTC, Giorgio Biacchi
no flags Details
engine db dump (6.32 MB, application/x-bzip)
2021-04-12 14:52 UTC, Giorgio Biacchi
no flags Details
engine.log (39.74 KB, text/plain)
2021-04-13 13:13 UTC, Giorgio Biacchi
no flags Details
vdsm.log (17.83 KB, text/plain)
2021-04-13 13:13 UTC, Giorgio Biacchi
no flags Details
Snapshot removal in UI (67.96 KB, image/png)
2022-08-29 07:26 UTC, Giulio Casella
no flags Details
Log for engine (220.55 KB, application/x-bzip)
2022-08-29 08:09 UTC, Giulio Casella
no flags Details
Log for host host01 (14.58 KB, application/x-bzip)
2022-08-29 08:10 UTC, Giulio Casella
no flags Details
Log for host host04 (352.90 KB, application/x-bzip)
2022-08-29 08:10 UTC, Giulio Casella
no flags Details
engine.log (359.85 KB, application/x-bzip)
2022-09-06 14:39 UTC, Giorgio Biacchi
no flags Details
vdsm.log (534.76 KB, application/x-bzip)
2022-09-06 14:39 UTC, Giorgio Biacchi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 524 0 None open fix snapshot removal when image already merged 2022-07-24 12:15:26 UTC

Description Giorgio Biacchi 2021-04-12 14:48:07 UTC
Description of problem:

If my analysis is correct, when our backup software (https://storware.eu/products/vprotect/) try to delete a snapshot via API the snapshot is merged and removed from disk but sometimes not from the engine database. 

Version-Release number of selected component (if applicable):


How reproducible:

Can't reproduce. It happens randomly.

Steps to Reproduce:
1.
2.
3.

Actual results:

A VM is randomly left in the status described in the summary because a snapshot is merged and removed from disk but not from the engine database.

Expected results:

When a snapshot is merged and removed from disk, the database must be updated to reflect the status on disk.


Additional info:

Comment 1 Giorgio Biacchi 2021-04-12 14:48:52 UTC
Created attachment 1771365 [details]
vdsm.log

Comment 2 Giorgio Biacchi 2021-04-12 14:49:29 UTC
Created attachment 1771366 [details]
engine.log

Comment 3 Giorgio Biacchi 2021-04-12 14:52:14 UTC
Created attachment 1771373 [details]
engine db dump

Comment 4 Giorgio Biacchi 2021-04-12 14:56:59 UTC
Here's the relevant output of:

vdsm-tool dump-volume-chains 459011cf-ebb6-46ff-831d-8ccfafd82c8a

...

 image:    2fd9b340-14ce-42cd-a880-8f75eb3b72e7

             - 19fa612d-2f93-46a5-93e7-c74afc2937fa
               status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 45902462976

             - e77e2773-eecb-46a6-87c7-14df14f7d95b
               status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 19327352832

             - 76321669-ac65-4426-8f2e-31b3e2d8050b
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 1073741824

...

Here's the database status:

engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='2fd9b340-14ce-42cd-a880-8f75eb3b72e7';
              image_guid              |               parentid               | imagestatus |            vm_snapshot_id            | volume_type | volume_format | active 
--------------------------------------+--------------------------------------+-------------+--------------------------------------+-------------+---------------+--------
 76321669-ac65-4426-8f2e-31b3e2d8050b | e77e2773-eecb-46a6-87c7-14df14f7d95b |           1 | b8be90ca-a1be-460e-875a-3eda2f31a27e |           2 |             4 | t
 6180c2e8-141a-4126-bcb1-5af5caa98175 | 19fa612d-2f93-46a5-93e7-c74afc2937fa |           4 | 281a4331-89fa-4544-b6ab-22c0e5ac6459 |           2 |             4 | f
 19fa612d-2f93-46a5-93e7-c74afc2937fa | 00000000-0000-0000-0000-000000000000 |           1 | 6488f4c8-efc4-43df-aafc-497fb7d51fbb |           2 |             4 | f
 e77e2773-eecb-46a6-87c7-14df14f7d95b | 6180c2e8-141a-4126-bcb1-5af5caa98175 |           1 | f44957b0-08b4-4133-b067-3ee1d9f86f02 |           2 |             4 | f
(4 rows)

Comment 5 Nir Soffer 2021-04-12 16:18:59 UTC
Did  you try to remove the snapshot from engine UI or API?

Engine should detect that the volume 6180c2e8-141a-4126-bcb1-5af5caa98175
is not in the current chain, and mark the snapshot removal as successful.
Engine will probably try to remove the volume from storage. Since the 
volume does not exist, engine should succeed.

There are 2 issues here:

- Snapshot deletion failed - there may be valid reason to failed deletion,
  known issues (e.g. race between extend and commit, fixed in 4.4.5), or 
  external issues. Snapshot left in illegal state is not necessary a bug.

- If snapshot deletion failed, retrying the snapshot deletion should
  succeed. This is the basic design of this feature. I expect that vprotect
  will retry the snapshot deletion operation during the same backup window.

Comment 6 RHEL Program Management 2021-04-12 16:18:59 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 7 Giorgio Biacchi 2021-04-13 08:53:57 UTC
The operation causing the issue is the removal of a snapshot done via API by vprotect. In the UI I see the exclamation mark with the warning and the subsequent attempts to remove the snapshot via UI or API (by vprotect) fail.

I can provide other logs or info if you need them. I'm on 4.4.5.11 on the engine and the hypervisor are Ovirt node 4.4.5 with no pending updates.

Comment 8 Nir Soffer 2021-04-13 12:28:54 UTC
(In reply to Giorgio Biacchi from comment #7)
> The operation causing the issue is the removal of a snapshot done via API by
> vprotect. In the UI I see the exclamation mark with the warning and the
> subsequent attempts to remove the snapshot via UI or API (by vprotect) fail.

This is expected, it removing snapshot fails, engine mark the snapshot as illegal,
since some of the data was merged, and the snapshot does not represent the state
of the vm at the time the snapshot was created.

As I explained in comment 5, in this state you can try to delete the snapshot again
from the UI or the API, and the operation should succeed. Please try that and if the
operation fails again, provide engine and vdsm logs showing the how the operation
failed.

I want to understand if this is the expected case of failed snapshot removal that
is fixed by the next attempt, or unexpected case when failed snapshot cannot be
recovered and require some manual fix in the database.

Comment 9 Giorgio Biacchi 2021-04-13 13:12:15 UTC
I just tried (again) to remove the snapshot from the UI without success.  I filed this bug only after many unsuccessful attempts to remove it via UI and after an email to the Ovirt mailing list.

I'm going to attach the requested logs for my last attempt.

Comment 10 Giorgio Biacchi 2021-04-13 13:13:08 UTC
Created attachment 1771619 [details]
engine.log

Comment 11 Giorgio Biacchi 2021-04-13 13:13:53 UTC
Created attachment 1771620 [details]
vdsm.log

Comment 12 Nir Soffer 2021-04-13 14:42:22 UTC
In vdsm log we see:

1. Starting merge

2021-04-12 13:22:50,526+0000 INFO  (jsonrpc/2) [api.virt] START merge(drive={'poolID': '42135536-980b-4ea9-ab66-b850ed8c1f2b', 'volumeID': 'e77e2773-eecb-46a6-87c7-14df14f7d95b', 'imageID': '2fd9b340-14ce-42cd-a880-8f75eb3b72e7', 'domainID': '459011cf-ebb6-46ff-831d-8ccfafd82c8a'}, baseVolUUID='6180c2e8-141a-4126-bcb1-5af5caa98175', topVolUUID='e77e2773-eecb-46a6-87c7-14df14f7d95b', bandwidth='0', jobUUID='9786d676-978f-4255-ab1e-1c323089f371') from=::ffff:159.149.129.244,47040, flow_id=23f3187a-9e4a-4bad-94cb-22c69e115f04, vmId=00b50a51-049e-45dd-9cad-d01fb3b8b10e (api:48)

2. Getting volume info fails, since the volume was already removed in previous merge

2021-04-12 13:22:50,683+0000 INFO  (jsonrpc/2) [vdsm.api] FINISH getVolumeInfo error=Volume does not exist: ('6180c2e8-141a-4126-bcb1-5af5caa98175',) from=::ffff:159.149.129.244,47040, flow_id=23f3187a-9e4a-4bad-94cb-22c69e115f04, task_id=b5f03185-e5ab-4fcc-b6df-74c060fe240d (api:52)
2021-04-12 13:22:50,683+0000 INFO  (jsonrpc/2) [storage.TaskManager.Task] (Task='b5f03185-e5ab-4fcc-b6df-74c060fe240d') aborting: Task is aborted: "value=Volume does not exist: ('6180c2e8-141a-4126-bcb1-5af5caa98175',) abortedcode=201" (task:1190)

3. Merge failed

2021-04-12 13:22:50,684+0000 INFO  (jsonrpc/2) [api.virt] FINISH merge return={'status': {'code': 52, 'message': "Merge failed: {'top': 'e77e2773-eecb-46a6-87c7-14df14f7d95b', 'base': '6180c2e8-141a-4126-bcb1-5af5caa98175', 'job': '9786d676-978f-4255-ab1e-1c323089f371', 'reason': 'Unable to get volume info for domain 459011cf-ebb6-46ff-831d-8ccfafd82c8a volume 6180c2e8-141a-4126-bcb1-5af5caa98175'}"}} from=::ffff:159.149.129.244,47040, flow_id=23f3187a-9e4a-4bad-94cb-22c69e115f04, vmId=00b50a51-049e-45dd-9cad-d01fb3b8b10e (api:54)

Vdsm response is less useful than it could be, since it does not include the reason
fro the failure, and engine cannot use this error to complete the merge. However when
engine get a MergeFailed error, it should check the current chain, and detect that
the merge was completed successfully.

Comment 13 Eyal Shenitzky 2021-04-26 10:25:43 UTC
Nir, is there something for us to do here?

Comment 14 Nir Soffer 2021-04-26 10:32:34 UTC
(In reply to Eyal Shenitzky from comment #13)
> Nir, is there something for us to do here?

Yes, when engine is starting a merge and the merge fails, engine should check
the actual image chain with vdsm. If the volume to be merged is not in the chain
it means the merge was successful.

Engine uses this logic during merge, but for some reason it is not used when
starting the merge.

The basic design of live merge is that after any error we can run merge again to 
recover. This bug found a major hole in current code.

Comment 17 Michal Skrivanek 2022-04-20 12:07:47 UTC
(In reply to Nir Soffer from comment #14)

> This bug found a major hole in current code.

And we still have that hole a year later. Still worth a fix?

Comment 18 Arik 2022-05-11 13:59:10 UTC
(In reply to Michal Skrivanek from comment #17)
> (In reply to Nir Soffer from comment #14)
> 
> > This bug found a major hole in current code.
> 
> And we still have that hole a year later. Still worth a fix?

I think it is
We do something similar in create-snapshot to see if the created volume is being used, should be similar

Comment 19 Arik 2022-05-11 14:01:01 UTC
Worth looking into whether we can also do that in order to prevent the situation of having a volume in the database that no longer exists on the storage as well..

Comment 21 Shir Fishbain 2022-05-30 20:02:42 UTC
QE doesn't have the capacity to verify this bug during 4.5.1.

Comment 22 Giulio Casella 2022-07-21 08:52:25 UTC
Any news about this issue? The situation has become unsustainable, every day, following backup we have to manually fix illegal snapshots.
To fix it we have to live migrate the disk with illegal snapshot to another storage domain (after this operation snapshot is no longer marked as illegal), then we can remove the snapshot. And eventually move back disk to the original storage domain. You can imagine this operation is complex, specially for big disks.
Another element to consider is the VM having this issue are always the same (more or less), with no apparent common pattern (windows and linux, small and big disks, VM with one or two disks, running on different hypervisors, ...), and it's not happening every time.

Comment 23 Arik 2022-07-24 12:15:27 UTC
(In reply to Giulio Casella from comment #22)
> Any news about this issue? The situation has become unsustainable, every
> day, following backup we have to manually fix illegal snapshots.
> To fix it we have to live migrate the disk with illegal snapshot to another
> storage domain (after this operation snapshot is no longer marked as
> illegal), then we can remove the snapshot. And eventually move back disk to
> the original storage domain. You can imagine this operation is complex,
> specially for big disks.
> Another element to consider is the VM having this issue are always the same
> (more or less), with no apparent common pattern (windows and linux, small
> and big disks, VM with one or two disks, running on different hypervisors,
> ...), and it's not happening every time.

a potential fix is posted and being reviewed now

Comment 25 Giulio Casella 2022-08-26 10:26:00 UTC
Should this bug be fixed in ovirt-engine-4.5.2.2? If so I'm afraid the issue is still present (I'm on ovirt-engine-4.5.2.4-1.el8.noarch).

Comment 26 Mark Kemel 2022-08-28 05:50:36 UTC
Hi Giulio, have you tried to remove the illegal snapshot? The operation should succeed now.

Comment 27 Giulio Casella 2022-08-29 07:25:22 UTC
Hi Mark, tried (many times), with no luck. See attached screenshot.

Comment 28 Giulio Casella 2022-08-29 07:26:06 UTC
Created attachment 1908247 [details]
Snapshot removal in UI

Comment 29 Mark Kemel 2022-08-29 07:31:51 UTC
Please provide logs - engine and vdsm.

Comment 30 Giulio Casella 2022-08-29 08:09:41 UTC
Created attachment 1908249 [details]
Log for engine

Comment 31 Giulio Casella 2022-08-29 08:10:03 UTC
Created attachment 1908250 [details]
Log for host host01

Comment 32 Giulio Casella 2022-08-29 08:10:24 UTC
Created attachment 1908251 [details]
Log for host host04

Comment 33 Giulio Casella 2022-08-29 08:12:29 UTC
Attached logs for engine and vdsm (from 2 hypervisors). VMs involved in recent failures are "dylan" (running on host01) and "fileserver" (running on host04).

Comment 34 Mark Kemel 2022-08-30 06:25:36 UTC
Hi Giulio,
Looking at the logs, it seems that the issue you are facing currently is not the same issue described above.

2022-08-29 09:17:11,890+02 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [0eced56f-689d-422b-b15c-20b824377b08] Failed to live merge. Top volume f8f84b1c-53ab-4c99-a01d-743ed3d7859b is still in qemu chain [0ea89fbc-d39a-48ff-aa2b-0381d79d7714, 55bb387f-01a6-41b6-b585-4bcaf2ea5e32, f8f84b1c-53ab-4c99-a01d-743ed3d7859b]

This log entry suggests that the volume cannot be merged for some reason. VDSM logs you have provided do not cover the time when this error occured. 
In any case, since it is another issue, I suggest to move this discussion to [ovirt-users].

Comment 35 Mark Kemel 2022-08-30 07:43:12 UTC
Or file a new bug

Comment 36 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.

Comment 37 Giorgio Biacchi 2022-09-06 14:37:57 UTC
I think we ended up again in this situation. We opened another bug ( https://bugzilla.redhat.com/show_bug.cgi?id=2122525 ) for snapshot related problems and, until today, we were able to recover from illegal snapshots status back to legal live migrating the disk(s) image(s) to another data domain. Today one of this tasks failed and now I have this situation:

[root@host03 ~]# vdsm-tool dump-volume-chains fbade149-e448-4f02-b382-afd28c0d9ffd

Images volume chains (base volume first)

   image:    fdfa6a94-35ed-4a9b-b8ef-438d9dfe8ea1

             - 55bb387f-01a6-41b6-b585-4bcaf2ea5e32
               status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 5234491392

             - ead5265d-dacb-4727-95c4-93ff46a496e6
               status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 805306368

             - af37524f-c311-4ff2-adc2-232b5c02539b
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 53687091200, truesize: 8858370048

engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='fdfa6a94-35ed-4a9b-b8ef-438d9dfe8ea1';
              image_guid              |               parentid               | imagestatus |            vm_snapshot_id            | volume_type | volume_format | active
--------------------------------------+--------------------------------------+-------------+--------------------------------------+-------------+---------------+--------
 9dbb2076-c805-4d89-8209-081ecd46c59c | ead5265d-dacb-4727-95c4-93ff46a496e6 |           1 | 84d87810-b61e-4c07-bffc-cb6e420ca21e |           2 |             4 | f
 af37524f-c311-4ff2-adc2-232b5c02539b | 9dbb2076-c805-4d89-8209-081ecd46c59c |           1 | 8618fd25-b951-4926-aafa-5e93c35e1fdd |           2 |             4 | t
 ead5265d-dacb-4727-95c4-93ff46a496e6 | 55bb387f-01a6-41b6-b585-4bcaf2ea5e32 |           1 | bcf88e2a-15ac-4a47-8d27-96452a8ccce6 |           2 |             4 | f
 55bb387f-01a6-41b6-b585-4bcaf2ea5e32 | 00000000-0000-0000-0000-000000000000 |           1 | 20a89231-643b-465a-bd19-a84a1e266ac1 |           2 |             4 | f

We tried to remove again the snapshots from the UI but the task fails. To furtherly complicate the things the VM have two disks, but the situation of the other disk seems ok:

   image:    c4c09af6-976f-40cb-92a1-5e71bcea19ae

             - eb4a433b-eb2c-4a7e-8899-98692c2ae005
               status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 2684354560000, truesize: 2257676402688

             - ff3dfbd4-266f-427d-8939-600a5c90ec3d
               status: OK, voltype: INTERNAL, format: COW, legality: LEGAL, type: SPARSE, capacity: 2684354560000, truesize: 8053063680

             - 93b26585-fb7b-48f0-87f2-9d2eac731e2b
               status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE, capacity: 2684354560000, truesize: 9529458688

engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='c4c09af6-976f-40cb-92a1-5e71bcea19ae';
              image_guid              |               parentid               | imagestatus |            vm_snapshot_id            | volume_type | volume_format | active
--------------------------------------+--------------------------------------+-------------+--------------------------------------+-------------+---------------+--------
 93b26585-fb7b-48f0-87f2-9d2eac731e2b | ff3dfbd4-266f-427d-8939-600a5c90ec3d |           1 | 8618fd25-b951-4926-aafa-5e93c35e1fdd |           2 |             4 | t
 ff3dfbd4-266f-427d-8939-600a5c90ec3d | eb4a433b-eb2c-4a7e-8899-98692c2ae005 |           4 | 42ef6853-d7ac-4967-b93f-bb5425963891 |           2 |             4 | f
 eb4a433b-eb2c-4a7e-8899-98692c2ae005 | 00000000-0000-0000-0000-000000000000 |           1 | 20a89231-643b-465a-bd19-a84a1e266ac1 |           2 |             4 | f
(3 rows)

I'm going to attach engine and vdsm logs asap.

Comment 38 Giorgio Biacchi 2022-09-06 14:39:01 UTC
Created attachment 1909818 [details]
engine.log

Comment 39 Giorgio Biacchi 2022-09-06 14:39:31 UTC
Created attachment 1909819 [details]
vdsm.log

Comment 40 Mark Kemel 2022-09-07 15:40:58 UTC
Hi Giorgio, please provide the relevant content of the snapshots table

select snapshot_id,snapshot_type,status,description from snapshots where vm_id='e058515c-8276-44d7-9f3c-e5429aa593ed';

(I assume the VM id is e058515c-8276-44d7-9f3c-e5429aa593ed, right?)

Comment 41 Giorgio Biacchi 2022-09-07 16:05:17 UTC
Hi,
here's the query result:

engine=# select snapshot_id,snapshot_type,status,description from snapshots where vm_id='e058515c-8276-44d7-9f3c-e5429aa593ed';
             snapshot_id              | snapshot_type | status |                        description
--------------------------------------+---------------+--------+------------------------------------------------------------
 42ef6853-d7ac-4967-b93f-bb5425963891 | REGULAR       | OK     | fileserver_Disk2 Auto-generated for Live Storage Migration
 8618fd25-b951-4926-aafa-5e93c35e1fdd | ACTIVE        | OK     | Active VM
 bcf88e2a-15ac-4a47-8d27-96452a8ccce6 | REGULAR       | OK     | fileserver_Disk1 Auto-generated for Live Storage Migration
 20a89231-643b-465a-bd19-a84a1e266ac1 | REGULAR       | OK     | vProtect 2022-09-05 12:12:44.587166
(4 rows)

Comment 42 Giorgio Biacchi 2022-10-19 10:15:45 UTC
Hi,
I still have the VM for which I reopened this bug with snapshots in illegal state. I did not change anything in the case you need other info, but now Ovirt 4.5.3 is out and I'd like to know how to proceed and bring back the situation to normal state.

There's still a weekly task running for snapshot cleanup by vProtect that fails with the following messages:  

18 ott 2022, 12:32:37
Failed to delete snapshot 'vProtect 2022-09-05 12:12:44.587166' for VM 'fileserver'.
	
18 ott 2022, 12:31:39
VDSM host03 command MergeVDS failed: Merge failed: {'job': 'cb28c1b3-4084-4cf1-a8e6-bd6c3db958cc', 'reason': "block copy still active: disk 'sdb' already in active block job"}

Can you please tell me if it's safe to proceed with the update to ovirt-4.5.3 or give me the instructions to fix the database?

Thanks

Comment 43 Giulio Casella 2022-10-24 09:45:32 UTC
Hi,
any news about this bug? I saw Arik removed 4.5.3 milestone as a target, still worth updating to 4.5.3?

Comment 45 Arik 2022-10-24 14:20:09 UTC
(In reply to Giulio Casella from comment #43)
> Hi,
> any news about this bug? I saw Arik removed 4.5.3 milestone as a target,
> still worth updating to 4.5.3?

Definitely not to 4.5.3 which was already released but we need to take another look at this

Comment 46 Arik 2022-10-30 17:09:50 UTC
(In reply to Giorgio Biacchi from comment #37)
> I think we ended up again in this situation. We opened another bug (
> https://bugzilla.redhat.com/show_bug.cgi?id=2122525 ) for snapshot related
> problems and, until today, we were able to recover from illegal snapshots
> status back to legal live migrating the disk(s) image(s) to another data
> domain.

yeah, that indeed seems different
in the original report we had three volumes on the storage:

> vdsm-tool dump-volume-chains 459011cf-ebb6-46ff-831d-8ccfafd82c8a
> 
> ...
> 
>  image:    2fd9b340-14ce-42cd-a880-8f75eb3b72e7
> 
>              - 19fa612d-2f93-46a5-93e7-c74afc2937fa
>                status: OK, voltype: INTERNAL, format: COW, legality: LEGAL,
> type: SPARSE, capacity: 53687091200, truesize: 45902462976
> 
>              - e77e2773-eecb-46a6-87c7-14df14f7d95b
>                status: OK, voltype: INTERNAL, format: COW, legality: LEGAL,
> type: SPARSE, capacity: 53687091200, truesize: 19327352832
> 
>              - 76321669-ac65-4426-8f2e-31b3e2d8050b
>                status: OK, voltype: LEAF, format: COW, legality: LEGAL,
> type: SPARSE, capacity: 53687091200, truesize: 1073741824

and four volumes in the database:

> engine=# select
> image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,
> active from images where
> image_group_id='2fd9b340-14ce-42cd-a880-8f75eb3b72e7';
>               image_guid              |               parentid              
> | imagestatus |            vm_snapshot_id            | volume_type |
> volume_format | active 
> --------------------------------------+--------------------------------------
> +-------------+--------------------------------------+-------------+---------
> ------+--------
>  76321669-ac65-4426-8f2e-31b3e2d8050b | e77e2773-eecb-46a6-87c7-14df14f7d95b
> |           1 | b8be90ca-a1be-460e-875a-3eda2f31a27e |           2 |        
> 4 | t
>  6180c2e8-141a-4126-bcb1-5af5caa98175 | 19fa612d-2f93-46a5-93e7-c74afc2937fa
> |           4 | 281a4331-89fa-4544-b6ab-22c0e5ac6459 |           2 |        
> 4 | f
>  19fa612d-2f93-46a5-93e7-c74afc2937fa | 00000000-0000-0000-0000-000000000000
> |           1 | 6488f4c8-efc4-43df-aafc-497fb7d51fbb |           2 |        
> 4 | f
>  e77e2773-eecb-46a6-87c7-14df14f7d95b | 6180c2e8-141a-4126-bcb1-5af5caa98175
> |           1 | f44957b0-08b4-4133-b067-3ee1d9f86f02 |           2 |        
> 4 | f
> (4 rows)

and all the snapshots that these volumes are correlated with existed in the database:

db=# select snapshot_id, vm_id, snapshot_type,description from snapshots where vm_id='00b50a51-049e-45dd-9cad-d01fb3b8b10e';
             snapshot_id              |                vm_id                 | snapshot_type |             description             
--------------------------------------+--------------------------------------+---------------+-------------------------------------
 f44957b0-08b4-4133-b067-3ee1d9f86f02 | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | REGULAR       | vProtect 2021-04-12 15:24:21.870915
 281a4331-89fa-4544-b6ab-22c0e5ac6459 | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | REGULAR       | vProtect 2021-04-09 15:00:13.983112
 b8be90ca-a1be-460e-875a-3eda2f31a27e | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | ACTIVE        | Active VM
 6488f4c8-efc4-43df-aafc-497fb7d51fbb | 00b50a51-049e-45dd-9cad-d01fb3b8b10e | REGULAR       | vProtect 2021-04-09 09:00:13.482308

so we could recover from that state as suggested in comment 14 but the scenario that is described in comment 37 is different since the vm snapshot 84d87810-b61e-4c07-bffc-cb6e420ca21e that the volume we want to remove is correlated with doesn't exist.

therefore I'm closing this bug again and I've changed the title of this bug to better reflect what was being addressed

as for the issue described in comment 37 - we really need to know more about that vm snapshot 84d87810-b61e-4c07-bffc-cb6e420ca21e that is not covered at all by the provided logs in order to understand what is going on here. if we can get more information on that, please file a github issue on that: https://github.com/oVirt/ovirt-engine/issues

(In reply to Giorgio Biacchi from comment #42)
> Can you please tell me if it's safe to proceed with the update to
> ovirt-4.5.3 or give me the instructions to fix the database?

I don't see why it should block you from upgrading to 4.5.3


Note You need to log in before you can comment on or make changes to this bug.