1980428 – [CBT] Starting backup fail, disks duplicated in UI/API when multiple completed image transfers for same disk exists

Bug 1980428 - [CBT] Starting backup fail, disks duplicated in UI/API when multiple completed image transfers for same disk exists

Summary: [CBT] Starting backup fail, disks duplicated in UI/API when multiple complete...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.7-1
Target Release:	---
Assignee:	Pavel Bar
QA Contact:	Amit Sharir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1981297
TreeView+	depends on / blocked

Reported:	2021-07-08 16:09 UTC by Amit Sharir
Modified:	2021-07-22 15:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-engine-4.4.7.7
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-22 15:14:21 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine.log (4.21 MB, text/plain) 2021-07-08 16:09 UTC, Amit Sharir	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2021:2868	None	None	None	2021-07-22 15:14:29 UTC
oVirt gerrit	115692	master	MERGED	core: fix the 'images_storage_domain_view' to show disks only once	2021-07-19 12:20:23 UTC
oVirt gerrit	115772	ovirt-engine-4.4.7.z	MERGED	core: fix the 'images_storage_domain_view' to show disks only once	2021-07-19 15:55:19 UTC

Description Amit Sharir 2021-07-08 16:09:11 UTC

Created attachment 1799704 [details]
engine.log

Description of problem:

When trying to do a full backup multiple times using the SDK backup_vm.py, I get an error message. 

Version-Release number of selected component (if applicable):
vdsm-4.40.70.6-1.el8ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1.run the following script more than 2 times : 
python3 ./backup_vm.py -c engine full <vm-uuid>


Actual results:
[root@<host> examples]# python3 ./backup_vm.py -c engine full 037936b9-2604-4569-80b8-97ccbcaf0043
[   0.0 ] Starting full backup for VM '037936b9-2604-4569-80b8-97ccbcaf0043'
Traceback (most recent call last):
  File "./backup_vm.py", line 493, in <module>
    main()
  File "./backup_vm.py", line 176, in main
    args.command(args)
  File "./backup_vm.py", line 190, in cmd_full
    backup = start_backup(connection, args)
  File "./backup_vm.py", line 319, in start_backup
    from_checkpoint_id=args.from_checkpoint_uuid
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 34139, in add
    return self._internal_add(backup, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, in _internal_add
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Internal Engine Error]". HTTP response code is 400.




Expected results:
[root@<host> examples]# python3 ./backup_vm.py -c engine full 037936b9-2604-4569-80b8-97ccbcaf0043
[   0.0 ] Starting full backup for VM '037936b9-2604-4569-80b8-97ccbcaf0043'
[   2.8 ] Waiting until backup '933c820d-4527-48a3-8941-a7bfccf85485' is ready
[   3.9 ] Created checkpoint '27076bb1-e351-449f-ad66-f67f9e15478f' (to use in --from-checkpoint-uuid for the next incremental backup)
[   4.0 ] Creating image transfer for disk '94678a53-850e-4cad-bca7-5d78712e5ccd'
[   6.3 ] Image transfer '05225022-989d-4af7-b898-1e2788daa5f9' is ready
[ 100.00% ] 10.00 GiB, 15.91 seconds, 643.77 MiB/s                             
[  22.2 ] Finalizing image transfer
[  29.4 ] Finalizing backup
[  29.5 ] Waiting until backup is being finalized
[  29.5 ] Full backup completed successfully

Additional info:

Comment 4 Benny Zlotnik 2021-07-08 16:15:50 UTC

This seems to be the issue

2021-07-08 18:50:08,636+03 ERROR [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-6) [f02fd0a2-df55-4f20-9c73-ec856c74650a] Command 'org.ovirt.en
gine.core.bll.storage.backup.StartVmBackupCommand' failed: CallableStatementCallback; SQL [{call insertvmbackupdiskmap(?, ?)}ERROR: duplicate key value violates unique constr
aint "vm_backup_disk_map_pkey"
  Detail: Key (backup_id, disk_id)=(1f754e89-8003-4e27-9289-8e8a22648ab9, 94678a53-850e-4cad-bca7-5d78712e5ccd) already exists.
  Where: SQL statement "INSERT INTO vm_backup_disk_map (
            backup_id,
            disk_id
            )
        VALUES (
            v_backup_id,
            v_disk_id
            )"
PL/pgSQL function insertvmbackupdiskmap(uuid,uuid) line 4 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "vm_backup_disk_map_pkey"
  Detail: Key (backup_id, disk_id)=(1f754e89-8003-4e27-9289-8e8a22648ab9, 94678a53-850e-4cad-bca7-5d78712e5ccd) already exists.
  Where: SQL statement "INSERT INTO vm_backup_disk_map (
            backup_id,
            disk_id
            )
        VALUES (
            v_backup_id,
            v_disk_id
            )"
PL/pgSQL function insertvmbackupdiskmap(uuid,uuid) line 4 at SQL statement

Comment 6 Nir Soffer 2021-07-08 23:09:19 UTC

I reproduced it locally.

I created full backup, then one incremental backup, both completed successfully.
The third backup failed with the same error as comment 4.

What I see in the API:

<backups>
  <backup href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab/backups/5dc722c8-47e6-405e-b193-efe8c8bc922f" id="5dc722c8-47e6-405e-b193-efe8c8bc922f">
    <actions>
      <link href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab/backups/5dc722c8-47e6-405e-b193-efe8c8bc922f/finalize" rel="finalize"/>
    </actions>
    <link href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab/backups/5dc722c8-47e6-405e-b193-efe8c8bc922f/disks" rel="disks"/>
    <creation_date>2021-07-08T21:26:21.544+03:00</creation_date>
    <phase>succeeded</phase>
    <to_checkpoint_id>069bfd97-76b1-4b9b-af67-0a7fe938c382</to_checkpoint_id>
    <host href="/ovirt-engine/api/hosts/0de92f8d-a32f-4058-a741-8cfbabd48e13" id="0de92f8d-a32f-4058-a741-8cfbabd48e13"/>
    <vm href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab" id="9c4050b1-8070-429f-ae36-9eae0935ccab"/>
  </backup>
  <backup href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab/backups/5da2718a-e168-43fe-8b0e-fd8227cfb6dd" id="5da2718a-e168-43fe-8b0e-fd8227cfb6dd">
    <actions>
      <link href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab/backups/5da2718a-e168-43fe-8b0e-fd8227cfb6dd/finalize" rel="finalize"/>
    </actions>
    <link href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab/backups/5da2718a-e168-43fe-8b0e-fd8227cfb6dd/disks" rel="disks"/>
    <creation_date>2021-07-08T21:27:31.385+03:00</creation_date>
    <from_checkpoint_id>069bfd97-76b1-4b9b-af67-0a7fe938c382</from_checkpoint_id>
    <phase>succeeded</phase>
    <to_checkpoint_id>cf107646-30eb-4f58-bafb-bcaadaa0fd96</to_checkpoint_id>
    <host href="/ovirt-engine/api/hosts/0de92f8d-a32f-4058-a741-8cfbabd48e13" id="0de92f8d-a32f-4058-a741-8cfbabd48e13"/>
    <vm href="/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab" id="9c4050b1-8070-429f-ae36-9eae0935ccab"/>
  </backup>
</backups>

2 backup, expected.

Looking at the disks of the first backup:

    https://engine-dev/ovirt-engine/api/vms/9c4050b1-8070-429f-ae36-9eae0935ccab/backups/5dc722c8-47e6-405e-b193-efe8c8bc922f/disks

I see same disk twice:

  <disks>
    <disk id="3649d84b-6f35-4314-900a-5e8024e3905c">
    ...
    </disk>
    <disk id="3649d84b-6f35-4314-900a-5e8024e3905c">
    ...
    </disk>
  </disks>

This vm has one disk!

Looking in the vms > vm name > disks, I see again the same disk twice!
(see attachment 1799832 [details])

Backup fail consistently with the same error - until the backup are deleted
by the cleanup thread.

Once the backups are deleted, we can start new backups successfully.

So I think what's going on is that keeping the backup in the db exposed an old
issue with the way we list vm disks. When we have multiple backups, listing
vm disks return the same disk for every backup.

When we try to start new backup, engine list the same disk twice again, and then
the sql unique constraint in the vm_backup_disk_map table.

This bug may break other flows, we cannot keep backups until we understand this
issue.

Comment 8 Nir Soffer 2021-07-11 20:20:44 UTC

We found the reason for this bug - keeping image transfers after they
finish, introduced in https://gerrit.ovirt.org/c/ovirt-engine/+/115350
for bug 1849861.

Turns out the system is assuming that there is exactly one or zero 
image transfer entity for every disk.

In images_storage_domain_view we select fields from image_transfers
table:

 SELECT images.image_guid,
    storage_domain_static.storage_name,
    storage_pool_iso_map.storage_pool_id,
    storage_domain_static.storage_type,
    images.creation_date,
    images.size,
    images.it_guid,
    snap.description,
    snap.creation_date AS snapshot_creation_date,
    images.parentid,
    images.lastmodified,
    snap.app_list,
    image_storage_domain_map.storage_domain_id AS storage_id,
    images.vm_snapshot_id,
    images.volume_type,
    images.volume_format,
    images.qcow_compat,
    images.imagestatus,
    images.image_group_id,
    images.active,
    images.volume_classification,
    vms_for_disk_view.entity_type,
    array_to_string(vms_for_disk_view.array_vm_names, ','::text) AS vm_names,
    COALESCE(array_upper(vms_for_disk_view.array_vm_names, 1), 0) AS number_of_vms,
    array_to_string(vms_for_disk_view.array_template_version_name, ','::text) AS template_version_names,
    base_disks.disk_id,
    base_disks.disk_alias,
    base_disks.disk_description,
    base_disks.shareable,
    base_disks.wipe_after_delete,
    base_disks.propagate_errors,
    base_disks.sgio,
    base_disks.disk_content_type,
    base_disks.backup,
    base_disks.backup_mode,
    image_storage_domain_map.quota_id,
    quota.quota_name,
    storage_pool.quota_enforcement_type,
    image_storage_domain_map.disk_profile_id,
    disk_profiles.name AS disk_profile_name,
    disk_image_dynamic.actual_size,
    disk_image_dynamic.read_rate,
    disk_image_dynamic.read_ops,
    disk_image_dynamic.write_rate,
    disk_image_dynamic.write_ops,
    disk_image_dynamic.read_latency_seconds,
    disk_image_dynamic.write_latency_seconds,
    disk_image_dynamic.flush_latency_seconds,
    base_disks.disk_storage_type,
    base_disks.cinder_volume_type,
    image_transfers.phase AS image_transfer_phase,
    image_transfers.type AS image_transfer_type,
    image_transfers.bytes_sent AS image_transfer_bytes_sent,
    image_transfers.bytes_total AS image_transfer_bytes_total,
    entity_step_progress.progress
   FROM images
     LEFT JOIN disk_image_dynamic ON images.image_guid = disk_image_dynamic.image_id
     LEFT JOIN base_disks ON images.image_group_id = base_disks.disk_id
     LEFT JOIN vms_for_disk_view ON vms_for_disk_view.device_id = images.image_group_id
     LEFT JOIN image_storage_domain_map ON image_storage_domain_map.image_id = images.image_guid
     LEFT JOIN storage_domain_static ON image_storage_domain_map.storage_domain_id = storage_domain_static.id
     LEFT JOIN snapshots snap ON images.vm_snapshot_id = snap.snapshot_id
     LEFT JOIN quota ON image_storage_domain_map.quota_id = quota.id
     LEFT JOIN disk_profiles ON image_storage_domain_map.disk_profile_id = disk_profiles.id
     LEFT JOIN storage_pool_iso_map ON storage_pool_iso_map.storage_id = storage_domain_static.id
     LEFT JOIN storage_pool ON storage_pool.id = storage_pool_iso_map.storage_pool_id
     LEFT JOIN image_transfers ON images.image_group_id = image_transfers.disk_id
     LEFT JOIN entity_step_progress ON images.image_group_id = entity_step_progress.entity_id
  WHERE images.image_guid <> '00000000-0000-0000-0000-000000000000'::uuid;

In the past there was zero or one image transfer per disk, so we got
at most one entry per disk in the view. Now we may have multiple
image transfers per disk since image transfers are kept for 15 minutes
after they complete, so we get one entry per image transfer.

When the system have more than one image transfer per disk, disks
are duplicated in all the queries using images_storage_domain_view like:

- all_disks_including_snapshots
- all_disks_including_snapshots_and_memory
- all_disks_for_vms (uses all_disks_including_snapshots_and_memory)

This causes same disk to appear multiple times in:
- storage disks tab
- compute vms disks tab
- compute vms snapshosts tab (disks list)

Here is a simple example for duplicating one disk 10 times:

$ for i in $(seq 10); do
    ./download_disk.py -c engine-dev 3649d84b-6f35-4314-900a-5e8024e3905c /dev/shm/tmp.qcow2
done

# select image_group_id from all_disks_for_vms where vm_id = '9c4050b1-8070-429f-ae36-9eae0935ccab';
            image_group_id            
--------------------------------------
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
 3649d84b-6f35-4314-900a-5e8024e3905c
(11 rows)


See attached screenshots - the disk appears 10 times everywhere.

One we have more than one image transfer, trying to backup the vm
will fail:

$ ./backup_vm.py -c engine-dev full 9c4050b1-8070-429f-ae36-9eae0935ccab
[   0.0 ] Starting full backup for VM '9c4050b1-8070-429f-ae36-9eae0935ccab'
Traceback (most recent call last):
  File "./backup_vm.py", line 509, in <module>
    main()
  File "./backup_vm.py", line 176, in main
    args.command(args)
  File "./backup_vm.py", line 190, in cmd_full
    backup = start_backup(connection, args)
  File "./backup_vm.py", line 317, in start_backup
    backup = backups_service.add(
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 34139, in add
    return self._internal_add(backup, headers, query, wait)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 232, in _internal_add
    return future.wait() if wait else future
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 229, in callback
    self._check_fault(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Internal Engine Error]". HTTP response code is 400.

In engine log we see:

20-4840-9be2-8f4a3dbda4c9] Creating VmBackup entity for VM '9c4050b1-8070-429f-ae36-9eae0935ccab'
2021-07-11 23:10:45,977+03 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (default task-1) [cc356315-dd20-
4840-9be2-8f4a3dbda4c9] transaction rolled back
2021-07-11 23:10:45,977+03 ERROR [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-1) [cc356315-dd
20-4840-9be2-8f4a3dbda4c9] Command 'org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand' failed: CallableStatementCa
llback; SQL [{call insertvmbackupdiskmap(?, ?)}ERROR: duplicate key value violates unique constraint "vm_backup_disk_map_pkey"
  Detail: Key (backup_id, disk_id)=(9562af1c-d2cb-41b2-943d-93dafdbd2754, 3649d84b-6f35-4314-900a-5e8024e3905c) already exists
.
  Where: SQL statement "INSERT INTO vm_backup_disk_map (
            backup_id,
            disk_id
            )
        VALUES (
            v_backup_id,
            v_disk_id
            )"


Backup get multiple entries for disk 3649d84b-6f35-4314-900a-5e8024e3905c
and try to add duplicate pairs to vm_backup_disk_map which is not allowed.

I think the way to fix this is to consider only active image transfers
in images_storage_domain_view. We cannot have more than one active
transfer for the same disk due to locking.

The UI is also designed for supporting one active image transfer per disk,
since we show the transfer status in the disk status column.

Comment 12 Amit Sharir 2021-07-12 07:50:52 UTC

I ran a double full backup scenario on the relevant version mentioned in the description of the bug.
Version:

vdsm-4.40.70.6-1.el8ev.x86_64
ovirt-engine-4.4.7.6-0.11.el8ev.noarch
rhv-release-4.4.7-6-001.noarch

 
After 2 sequential successful backups - I started receiving errors on my terminal.

I also received in the UI 2 appearances of the same disk (like mentioned in comment #6 attachment https://bugzilla.redhat.com/attachment.cgi?id=1799832).


Terminal responses for the sequential backups:

[root@<env>-vdsm1 examples]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full --backup-dir /var/tmp/ 4e5a3871-42ac-4ffb-bb8e-0210dea35ffe
[   0.0 ] Starting full backup for VM '4e5a3871-42ac-4ffb-bb8e-0210dea35ffe'
[   1.7 ] Waiting until backup 'd5e96ca8-5b18-4c61-bfd6-c9212be27b2a' is ready
[   2.8 ] Created checkpoint '890468d7-dd71-486a-aa68-3182e89068ae' (to use in --from-checkpoint-uuid for the next incremental backup)
[   4.2 ] Creating image transfer for disk '5c0ae12a-063d-4937-a48b-c46674f196bd'
[   9.2 ] Image transfer 'a3878d0b-094e-4643-8450-22e58e3645a6' is ready
[ 100.00% ] 10.00 GiB, 11.53 seconds, 887.94 MiB/s                             
[  20.8 ] Finalizing image transfer
[  32.0 ] Finalizing backup
[  32.1 ] Waiting until backup is being finalized
[  32.2 ] Full backup completed successfully
[root@<env>-vdsm1 examples]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full --backup-dir /var/tmp/ 4e5a3871-42ac-4ffb-bb8e-0210dea35ffe
[   0.0 ] Starting full backup for VM '4e5a3871-42ac-4ffb-bb8e-0210dea35ffe'
[   1.8 ] Waiting until backup '4ece78a5-f3c7-43e2-89b6-eb5c20e600eb' is ready
[   2.8 ] Created checkpoint '3a383fb9-caf2-4549-beed-1624fdd07c8c' (to use in --from-checkpoint-uuid for the next incremental backup)
[   2.9 ] Creating image transfer for disk '5c0ae12a-063d-4937-a48b-c46674f196bd'
[   4.9 ] Image transfer 'a0efada5-be91-42e8-8f18-70a0d725bc3a' is ready
[ 100.00% ] 10.00 GiB, 12.90 seconds, 793.49 MiB/s                             
[  17.8 ] Finalizing image transfer
[  18.8 ] Finalizing backup
[  18.8 ] Waiting until backup is being finalized
[  18.9 ] Full backup completed successfully
[root@<env>-vdsm1 examples]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full --backup-dir /var/tmp/ 4e5a3871-42ac-4ffb-bb8e-0210dea35ffe
[   0.0 ] Starting full backup for VM '4e5a3871-42ac-4ffb-bb8e-0210dea35ffe'
Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 493, in <module>
    main()
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 176, in main
    args.command(args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 190, in cmd_full
    backup = start_backup(connection, args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 319, in start_backup
    from_checkpoint_id=args.from_checkpoint_uuid
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 34139, in add
    return self._internal_add(backup, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, in _internal_add
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Internal Engine Error]". HTTP response code is 400.
[root@<env>-vdsm1 examples]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full --backup-dir /var/tmp/ 4e5a3871-42ac-4ffb-bb8e-0210dea35ffe
[   0.0 ] Starting full backup for VM '4e5a3871-42ac-4ffb-bb8e-0210dea35ffe'
Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 493, in <module>
    main()
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 176, in main
    args.command(args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 190, in cmd_full
    backup = start_backup(connection, args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 319, in start_backup
    from_checkpoint_id=args.from_checkpoint_uuid
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 34139, in add
    return self._internal_add(backup, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, in _internal_add
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Internal Engine Error]". HTTP response code is 400.






After ~20 minutes, the double appearance of the disks vanished and I was left with only 1 backup disk in the UI. 
In addition, I was able to complete a full backup of the same VM again. 

Response from the additional backup:

[root@<env>-vdsm1 examples]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full --backup-dir /var/tmp/ 4e5a3871-42ac-4ffb-bb8e-0210dea35ffe
[   0.0 ] Starting full backup for VM '4e5a3871-42ac-4ffb-bb8e-0210dea35ffe'
[   1.2 ] Waiting until backup '38c03df6-913e-49b3-a26c-e8624eff460a' is ready
[   2.2 ] Created checkpoint 'b9cdad7d-9f01-482d-ad5f-1b4c163ea9d9' (to use in --from-checkpoint-uuid for the next incremental backup)
[   2.3 ] Creating image transfer for disk '5c0ae12a-063d-4937-a48b-c46674f196bd'
[   4.1 ] Image transfer 'f32d9624-aa8e-406d-b0b8-a598d77542ad' is ready
[ 100.00% ] 10.00 GiB, 12.11 seconds, 845.48 MiB/s                             
[  16.2 ] Finalizing image transfer
[  17.2 ] Finalizing backup
[  17.4 ] Waiting until backup is being finalized
[  17.5 ] Full backup completed successfully


Adding the relevant logs in the following comments.

Comment 17 Nir Soffer 2021-07-12 16:35:54 UTC

Benny, I don't think we can release 4.4.7 at this broken state.

Any operation consuming disks from the database can be consumed 
by the duplicate entries for many minutes.

We can still fix for 4.4.7 async.

Comment 18 Amit Sharir 2021-07-13 11:41:56 UTC

Nir, we are not sure we understand what your last comment means. 
What exactly is the impact of what you mentioned.
Can you mention a few flows that might be affected?

Comment 19 Pavel Bar 2021-07-13 14:23:23 UTC

Another scenario that fails that need to be tested after the bug is fixed:
1) Start the backup (can be a cold backup, 1 disk, VM Guid is "a1b9af74-bfe3-417b-9508-7efe5e068952"), run from CLI:
./backup_vm.py -c engine1 start a1b9af74-bfe3-417b-9508-7efe5e068952
2) Download the disk (backup Guid is "32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f"), run from CLI:
./backup_vm.py -c engine1 download --backup-dir /tmp/pavel --backup-uuid 32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f a1b9af74-bfe3-417b-9508-7efe5e068952
3) Run the command from step 2 again and again from CLI.
Every time the number of disks shown in UI (Compute --> Virtual Machines --> <VM name> --> Snapshots tab --> Disks subtab) will increase by the factor of 2: 1 disk --> 2 disks --> 4 disks --> 8 disks --> 16 disks --> 32 disks ...

Comment 20 Avihai 2021-07-18 11:48:36 UTC

(In reply to Nir Soffer from comment #17)
> Benny, I don't think we can release 4.4.7 at this broken state.
> 
> Any operation consuming disks from the database can be consumed 
> by the duplicate entries for many minutes.
> 
> We can still fix for 4.4.7 async.

This sounds very risky and may need also 4.4.7 async/backport.
Eyal how close are we to a fix? If we are far or it's risky we should consider reverting the change that caused this issue.

Comment 21 Nir Soffer 2021-07-19 12:30:55 UTC

(In reply to Amit Sharir from comment #18)
> Nir, we are not sure we understand what your last comment means. 
> What exactly is the impact of what you mentioned.
> Can you mention a few flows that might be affected?

Pavel is checking which flows are affected, and will update the bug later.

Comment 23 Evelina Shames 2021-07-21 12:42:21 UTC

Tested the following flows on rhv-4.4.7-7:

1. Multiple backups as mentioned in this bug - works as expected

2. Flow mentioned in comment 19 - works as expected.
(In reply to Pavel Bar from comment #19)
> Another scenario that fails that need to be tested after the bug is fixed:
> 1) Start the backup (can be a cold backup, 1 disk, VM Guid is
> "a1b9af74-bfe3-417b-9508-7efe5e068952"), run from CLI:
> ./backup_vm.py -c engine1 start a1b9af74-bfe3-417b-9508-7efe5e068952
> 2) Download the disk (backup Guid is
> "32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f"), run from CLI:
> ./backup_vm.py -c engine1 download --backup-dir /tmp/pavel --backup-uuid
> 32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f a1b9af74-bfe3-417b-9508-7efe5e068952
> 3) Run the command from step 2 again and again from CLI.
> Every time the number of disks shown in UI (Compute --> Virtual Machines -->
> <VM name> --> Snapshots tab --> Disks subtab) will increase by the factor of
> 2: 1 disk --> 2 disks --> 4 disks --> 8 disks --> 16 disks --> 32 disks ...


Additional flows:

3.Create/Remove Snapshot
Scenario (run once with running VM and once with non-running VM):
- Create a VM and attach a 1G disk on iscsi
- Create 2 snapshots: s1, s2
- Run twice image transfer (from CLI): (shutdown the VM if it was running)
   ./download_disk.py -c engine1 --debug 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
- Delete s1 (from UI)
> Before the fix - a pop-up message: “VM1: General command validation failure”, errors in logs, disk stays in “Locked” status and duplicate of disk entries on UI.
> After the fix - Operation succeeded, no errors on UI/logs.

4. Move/Copy (with snapshots) 
Scenario (run once with running VM and once with non-running VM):
- Create a VM and attach a 1G disk on iscsi
- Run the VM if needed
- Create snapshot: s1
- Run twice image transfer (from CLI): (shutdown the VM if it was running)
   ./download_disk.py -c engine1 --debug 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
> Before the fix - Disk stays in “Locked” status, duplicate of disk entries on UI, errors in logs.
> After the fix - one entry on UI, no errors in logs

Move/Copy that disk to another SD (from UI) 
> Before the fix - duplicate of disk entries on UI, errors in logs, a pop-up message seems successful although the operation failed: “Finished Moving Disk VM1_Disk1 from NFS_SD1 to NFS_SD3.”
> After the fix:
For MOVE operation - operation succeeded successfully, no errors in logs, one disk entry.
For COPY operation - src disk is locked forever - known bug 1983414

5. Snapshot preview / commit / restore  
Scenario (run once with running VM and once with non-running VM):
- Create a VM and attach a 1G disk on iscsi
- Run the VM if needed
- Run twice image transfer (from CLI): (shutdown the VM if it was running)
   ./download_disk.py -c engine1 --debug 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
- Create 2 snapshots: s1, s2
- Preview s1:
> Before the fix - duplicate of disks entries on UI, errors in logs, ‘Preview’ button doesn't respond.
> After the fix - one entry on UI, no errors in logs and preview button is working + commit & perview works as well.

6. Export VM to Data Domain (with snapshots, with/out collapse) 
Scenario (run with non-running VM):
- Create a VM and attach a disk on iscsi.
- Create a snapshot (from UI)
- Run twice image transfer (from CLI):
  ./download_disk.py -c engine1 --debug 52c8dd88-b140-4259-a3ec-c9c17e3dca27
 /tmp/pavel/download.qcow2
> Before the fix - duplicate of disks entries on UI
> After the fix - one entry on UI

- Export the VM to a data domain once with collapse and once without:
> Before the fix - Operation is broken with errors in logs
> After the fix -  everything looks ok and works as expected


Now, all is left is to verity that there are no regressions.
Our regression automation tests are running - I'll update once we'll done analyzing them.

Eyal/Benny/Pavel - pls ACK/NACK if these flows are enough or there is anything else to test.
Thanks

Comment 25 Evelina Shames 2021-07-22 10:38:11 UTC

(In reply to Evelina Shames from comment #23)
> Tested the following flows on rhv-4.4.7-7:
> 
> 1. Multiple backups as mentioned in this bug - works as expected
> 
> 2. Flow mentioned in comment 19 - works as expected.
> (In reply to Pavel Bar from comment #19)
> > Another scenario that fails that need to be tested after the bug is fixed:
> > 1) Start the backup (can be a cold backup, 1 disk, VM Guid is
> > "a1b9af74-bfe3-417b-9508-7efe5e068952"), run from CLI:
> > ./backup_vm.py -c engine1 start a1b9af74-bfe3-417b-9508-7efe5e068952
> > 2) Download the disk (backup Guid is
> > "32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f"), run from CLI:
> > ./backup_vm.py -c engine1 download --backup-dir /tmp/pavel --backup-uuid
> > 32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f a1b9af74-bfe3-417b-9508-7efe5e068952
> > 3) Run the command from step 2 again and again from CLI.
> > Every time the number of disks shown in UI (Compute --> Virtual Machines -->
> > <VM name> --> Snapshots tab --> Disks subtab) will increase by the factor of
> > 2: 1 disk --> 2 disks --> 4 disks --> 8 disks --> 16 disks --> 32 disks ...
> 
> 
> Additional flows:
> 
> 3.Create/Remove Snapshot
> Scenario (run once with running VM and once with non-running VM):
> - Create a VM and attach a 1G disk on iscsi
> - Create 2 snapshots: s1, s2
> - Run twice image transfer (from CLI): (shutdown the VM if it was running)
>    ./download_disk.py -c engine1 --debug
> 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
> - Delete s1 (from UI)
> > Before the fix - a pop-up message: “VM1: General command validation failure”, errors in logs, disk stays in “Locked” status and duplicate of disk entries on UI.
> > After the fix - Operation succeeded, no errors on UI/logs.
> 
> 4. Move/Copy (with snapshots) 
> Scenario (run once with running VM and once with non-running VM):
> - Create a VM and attach a 1G disk on iscsi
> - Run the VM if needed
> - Create snapshot: s1
> - Run twice image transfer (from CLI): (shutdown the VM if it was running)
>    ./download_disk.py -c engine1 --debug
> 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
> > Before the fix - Disk stays in “Locked” status, duplicate of disk entries on UI, errors in logs.
> > After the fix - one entry on UI, no errors in logs
> 
> Move/Copy that disk to another SD (from UI) 
> > Before the fix - duplicate of disk entries on UI, errors in logs, a pop-up message seems successful although the operation failed: “Finished Moving Disk VM1_Disk1 from NFS_SD1 to NFS_SD3.”
> > After the fix:
> For MOVE operation - operation succeeded successfully, no errors in logs,
> one disk entry.
> For COPY operation - src disk is locked forever - known bug 1983414
> 
> 5. Snapshot preview / commit / restore  
> Scenario (run once with running VM and once with non-running VM):
> - Create a VM and attach a 1G disk on iscsi
> - Run the VM if needed
> - Run twice image transfer (from CLI): (shutdown the VM if it was running)
>    ./download_disk.py -c engine1 --debug
> 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
> - Create 2 snapshots: s1, s2
> - Preview s1:
> > Before the fix - duplicate of disks entries on UI, errors in logs, ‘Preview’ button doesn't respond.
> > After the fix - one entry on UI, no errors in logs and preview button is working + commit & perview works as well.
> 
> 6. Export VM to Data Domain (with snapshots, with/out collapse) 
> Scenario (run with non-running VM):
> - Create a VM and attach a disk on iscsi.
> - Create a snapshot (from UI)
> - Run twice image transfer (from CLI):
>   ./download_disk.py -c engine1 --debug 52c8dd88-b140-4259-a3ec-c9c17e3dca27
>  /tmp/pavel/download.qcow2
> > Before the fix - duplicate of disks entries on UI
> > After the fix - one entry on UI
> 
> - Export the VM to a data domain once with collapse and once without:
> > Before the fix - Operation is broken with errors in logs
> > After the fix -  everything looks ok and works as expected
> 
> 
> Now, all is left is to verity that there are no regressions.
> Our regression automation tests are running - I'll update once we'll done
> analyzing them.
We are done analyzing tier2+3 - no new issues/regressions were found.

Comment 26 Benny Zlotnik 2021-07-22 10:39:35 UTC

(In reply to Evelina Shames from comment #23)
> Tested the following flows on rhv-4.4.7-7:
> 
> 1. Multiple backups as mentioned in this bug - works as expected
> 
> 2. Flow mentioned in comment 19 - works as expected.
> (In reply to Pavel Bar from comment #19)
> > Another scenario that fails that need to be tested after the bug is fixed:
> > 1) Start the backup (can be a cold backup, 1 disk, VM Guid is
> > "a1b9af74-bfe3-417b-9508-7efe5e068952"), run from CLI:
> > ./backup_vm.py -c engine1 start a1b9af74-bfe3-417b-9508-7efe5e068952
> > 2) Download the disk (backup Guid is
> > "32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f"), run from CLI:
> > ./backup_vm.py -c engine1 download --backup-dir /tmp/pavel --backup-uuid
> > 32c6b9b8-5e7c-418e-9a75-83c5f2b0e36f a1b9af74-bfe3-417b-9508-7efe5e068952
> > 3) Run the command from step 2 again and again from CLI.
> > Every time the number of disks shown in UI (Compute --> Virtual Machines -->
> > <VM name> --> Snapshots tab --> Disks subtab) will increase by the factor of
> > 2: 1 disk --> 2 disks --> 4 disks --> 8 disks --> 16 disks --> 32 disks ...
> 
> 
> Additional flows:
> 
> 3.Create/Remove Snapshot
> Scenario (run once with running VM and once with non-running VM):
> - Create a VM and attach a 1G disk on iscsi
> - Create 2 snapshots: s1, s2
> - Run twice image transfer (from CLI): (shutdown the VM if it was running)
>    ./download_disk.py -c engine1 --debug
> 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
> - Delete s1 (from UI)
> > Before the fix - a pop-up message: “VM1: General command validation failure”, errors in logs, disk stays in “Locked” status and duplicate of disk entries on UI.
> > After the fix - Operation succeeded, no errors on UI/logs.
> 
> 4. Move/Copy (with snapshots) 
> Scenario (run once with running VM and once with non-running VM):
> - Create a VM and attach a 1G disk on iscsi
> - Run the VM if needed
> - Create snapshot: s1
> - Run twice image transfer (from CLI): (shutdown the VM if it was running)
>    ./download_disk.py -c engine1 --debug
> 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
> > Before the fix - Disk stays in “Locked” status, duplicate of disk entries on UI, errors in logs.
> > After the fix - one entry on UI, no errors in logs
> 
> Move/Copy that disk to another SD (from UI) 
> > Before the fix - duplicate of disk entries on UI, errors in logs, a pop-up message seems successful although the operation failed: “Finished Moving Disk VM1_Disk1 from NFS_SD1 to NFS_SD3.”
> > After the fix:
> For MOVE operation - operation succeeded successfully, no errors in logs,
> one disk entry.
> For COPY operation - src disk is locked forever - known bug 1983414
> 
> 5. Snapshot preview / commit / restore  
> Scenario (run once with running VM and once with non-running VM):
> - Create a VM and attach a 1G disk on iscsi
> - Run the VM if needed
> - Run twice image transfer (from CLI): (shutdown the VM if it was running)
>    ./download_disk.py -c engine1 --debug
> 589a966d-05ef-4613-8109-c620b3d05bf4 /tmp/pavel/download.qcow2
> - Create 2 snapshots: s1, s2
> - Preview s1:
> > Before the fix - duplicate of disks entries on UI, errors in logs, ‘Preview’ button doesn't respond.
> > After the fix - one entry on UI, no errors in logs and preview button is working + commit & perview works as well.
> 
> 6. Export VM to Data Domain (with snapshots, with/out collapse) 
> Scenario (run with non-running VM):
> - Create a VM and attach a disk on iscsi.
> - Create a snapshot (from UI)
> - Run twice image transfer (from CLI):
>   ./download_disk.py -c engine1 --debug 52c8dd88-b140-4259-a3ec-c9c17e3dca27
>  /tmp/pavel/download.qcow2
> > Before the fix - duplicate of disks entries on UI
> > After the fix - one entry on UI
> 
> - Export the VM to a data domain once with collapse and once without:
> > Before the fix - Operation is broken with errors in logs
> > After the fix -  everything looks ok and works as expected
> 
> 
> Now, all is left is to verity that there are no regressions.
> Our regression automation tests are running - I'll update once we'll done
> analyzing them.
> 
> Eyal/Benny/Pavel - pls ACK/NACK if these flows are enough or there is
> anything else to test.
> Thanks

acked

Comment 27 Evelina Shames 2021-07-22 10:40:52 UTC

Moving to VERIFIED.

Comment 31 errata-xmlrpc 2021-07-22 15:14:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (0-day RHV Manager (ovirt-engine) [ovirt-4.4.7]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2868

Note You need to log in before you can comment on or make changes to this bug.