2016913 – finalize backup fails while writing to disk with requested extension

Bug 2016913 - finalize backup fails while writing to disk with requested extension

Summary: finalize backup fails while writing to disk with requested extension

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.4.9.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.4.9-1
Target Release:	4.4.9.4
Assignee:	Eyal Shenitzky
QA Contact:	sshmulev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-25 06:20 UTC by sshmulev
Modified:	2021-11-20 07:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-engine-4.4.9.4
Clone Of:
Environment:
Last Closed:	2021-11-20 07:46:51 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	sbonazzo: ovirt-4.4+ michal.skrivanek: exception+

Attachments	(Terms of Use)
logs (1.32 MB, application/zip) 2021-10-25 06:20 UTC, sshmulev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43871	None	None	None	2021-10-25 06:24:42 UTC
oVirt gerrit	117299	master	MERGED	core: persist VM backup type (live/cold)	2021-10-27 15:03:21 UTC
oVirt gerrit	117350	ovirt-engine-4.4	MERGED	core: persist VM backup type (live/cold)	2021-10-27 17:20:27 UTC
oVirt gerrit	117354	ovirt-engine-4.4.9.z	MERGED	core: persist VM backup type (live/cold)	2021-10-27 17:43:17 UTC

Description sshmulev 2021-10-25 06:20:58 UTC

Created attachment 1836585 [details]
logs

Description of problem:
During the verification of bug 2015121 new issue was found: During Full backup, finalize backup fails while writing to disk with requested extension.


Versions:
ovirt-engine-4.4.9.3-0.3.el8ev.noarch
vdsm-4.40.90.3-1.el8ev.x86_64
libvirt-7.6.0-5.module+el8.5.0+12933+58cb48a1

Steps to reproduce:
1. Start a VM with 20g thin disk (enable incremental)
2. In the VM, starts dd process, writing 20g to the think disk

# dd if=/dev/zero bs=1M count=20480 of=/dev/sda oflag=direct conv=fsync status=progress

3. during the writing started backup

./backup_vm.py -c engine full dc411543-5a9c-440e-9567-12b83214ea28

Expected results:
finalize backup should end up successfully.

Actual results:
 ./backup_vm.py -c engine full b5621374-29b7-4d67-8928-9a8237699648
[   0.0 ] Starting full backup for VM 'b5621374-29b7-4d67-8928-9a8237699648'
[   1.7 ] Waiting until backup 'b7135713-df89-4325-8438-ec2e6bee7ec4' is ready
[  29.3 ] Created checkpoint 'cc2580d6-b905-47e4-8b68-f202ba2e9472' (to use in --from-checkpoint-uuid for the next incremental backup)
[  29.4 ] Downloading full backup for disk '684cb233-c78c-4225-8356-1f106719694c'
[  30.7 ] Finalizing backup
Traceback (most recent call last):
  File "./backup_vm.py", line 525, in <module>
    main()
  File "./backup_vm.py", line 191, in main
    args.command(args)
  File "./backup_vm.py", line 207, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 407, in download_backup
    incremental=has_incremental)
  File "./backup_vm.py", line 425, in download_disk
    timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy))
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 235, in create_transfer
    .format(transfer.id, transfer.phase))
RuntimeError: Unexpected transfer 284a6ee2-0c19-4a0a-8799-a23f1564c9e0 phase finalizing_failure

Comment 1 sshmulev 2021-10-25 06:33:00 UTC

The reproduction of this issue is not so clear, I assume that it might reproduce if check the "enable incremental backup" for the disk created.
I could reproduce it 3 times when the "enable incremental backup" was marked. When I disabled it, it didn't happen (not sure this is the trigger to it).

Comment 2 sshmulev 2021-10-25 08:09:49 UTC

Reproduces also in 4.4.8:

# ./backup_vm.py -c engine full 36653930-1a0a-4ecf-97af-da3f36c00e8d
[   0.0 ] Starting full backup for VM '36653930-1a0a-4ecf-97af-da3f36c00e8d'
[   1.3 ] Waiting until backup '71929277-32e3-469f-a111-f3545b3218ca' is ready
[  25.8 ] Created checkpoint '844c83fe-c987-4787-8e0b-41c593ed1355' (to use in --from-checkpoint-uuid for the next incremental backup)
[  25.8 ] Downloading full backup for disk '53c73291-ff86-43c9-8d0a-df6690630dd6'
[  27.3 ] Finalizing backup
Traceback (most recent call last):
  File "./backup_vm.py", line 525, in <module>
    main()
  File "./backup_vm.py", line 191, in main
    args.command(args)
  File "./backup_vm.py", line 207, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 407, in download_backup
    incremental=has_incremental)
  File "./backup_vm.py", line 425, in download_disk
    timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy))
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 235, in create_transfer
    .format(transfer.id, transfer.phase))
RuntimeError: Unexpected transfer a9bef242-484a-4f04-b771-1f1055b492a1 phase finalizing_failure


Versions:
4.4.8.6-0.1.el8ev
vdsm-4.40.80.6-1.el8ev.x86_64
libvirt-7.0.0-14.4.module+el8.4.0+12413+d23780e6

Comment 3 Benny Zlotnik 2021-10-25 10:40:06 UTC

2021-10-21 16:59:36,412+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-5) [3424710f-6a2a-4d41-8a03-5ae96996053a] START, AddImageTicketVDSCommand(HostName = host_mixed_2, AddImageTicketVDSCommandParameters:{hostId='eefe7565-0aa2-4e75-bc10-7e99817aaef9', ticketId='e5dcc834-aa06-4c84-aa0f-dc36433dad53', timeout='300', operations='[read]', size='21474836480', url='null', filename='null', sparse='false', transferId='284a6ee2-0c19-4a0a-8799-a23f1564c9e0', dirty='false'}), log id: 27770105
2021-10-21 16:59:36,428+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-5) [3424710f-6a2a-4d41-8a03-5ae96996053a] Failed in 'AddImageTicketVDS' method
2021-10-21 16:59:36,434+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [3424710f-6a2a-4d41-8a03-5ae96996053a] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command AddImageTicketVDS failed: Image daemon request failed: "status=400, reason=Bad Request, error=Invalid ticket: Invalid value for 'url': None: expecting a <class 'str'> value\n"

The problem is that url was empty, this happened because the VM was running but was paused, so StartVmBackup performed cold backup but TransferDiskImage expects it to be a live backup because the VM is not down[1]

    private boolean isLiveBackup() {
        return isBackup() && getBackupVm() != null && !getBackupVm().isDown();
    }


[1] https://github.com/oVirt/ovirt-engine/blob/e921ac72c181872004c02db674ca835470149894/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/disk/image/TransferDiskImageCommand.java#L1123

Comment 4 Nir Soffer 2021-10-25 12:56:56 UTC

(In reply to sshmulev from comment #0)
...
> Steps to reproduce:
> 1. Start a VM with 20g thin disk (enable incremental)
> 2. In the VM, starts dd process, writing 20g to the think disk
...
> 3. during the writing started backup
> 
> ./backup_vm.py -c engine full dc411543-5a9c-440e-9567-12b83214ea28
...
> Actual results:
>  ./backup_vm.py -c engine full b5621374-29b7-4d67-8928-9a8237699648
> [   0.0 ] Starting full backup for VM 'b5621374-29b7-4d67-8928-9a8237699648'
> [   1.7 ] Waiting until backup 'b7135713-df89-4325-8438-ec2e6bee7ec4' is
> ready
> [  29.3 ] Created checkpoint 'cc2580d6-b905-47e4-8b68-f202ba2e9472' (to use
> in --from-checkpoint-uuid for the next incremental backup)
> [  29.4 ] Downloading full backup for disk
> '684cb233-c78c-4225-8356-1f106719694c'
> [  30.7 ] Finalizing backup
> Traceback (most recent call last):
...
>   File "./backup_vm.py", line 407, in download_backup
>     incremental=has_incremental)
>   File "./backup_vm.py", line 425, in download_disk
>     timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy))
>   File
> "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py",
> line 235, in create_transfer
>     .format(transfer.id, transfer.phase))
> RuntimeError: Unexpected transfer 284a6ee2-0c19-4a0a-8799-a23f1564c9e0 phase
> finalizing_failure

This is not an error in finalizing the backup, but an error in downloading 
the disk 684cb233-c78c-4225-8356-1f106719694c. This is a high severity 
issue.

I'm missing more info on the system state after this failure:
- Did the transfer finish?
- Did the backup finish?

If a transfer or backup get stuck, the disk and vm remain locked, these are
urgent issues that should never happen.

Comment 5 Nir Soffer 2021-10-25 13:00:39 UTC

(In reply to Benny Zlotnik from comment #3) 
> The problem is that url was empty, this happened because the VM was running
> but was paused, so StartVmBackup performed cold backup but TransferDiskImage
> expects it to be a live backup because the VM is not down[1]
> 
>     private boolean isLiveBackup() {
>         return isBackup() && getBackupVm() != null &&
> !getBackupVm().isDown();
>     }

This is a very bad bug that can cause a data corruption. Cold backup should
be used only when the VM is DOWN, never when the VM is PAUSED or in any
other states which are not DOWN.

We have 3 states:
- live backup: vm is UP
- cold backup: vm is DOWN
- backup not possible: any other state

Comment 6 Avihai 2021-10-25 13:07:04 UTC

(In reply to Nir Soffer from comment #4)
> (In reply to sshmulev from comment #0)
> ...
> > Steps to reproduce:
> > 1. Start a VM with 20g thin disk (enable incremental)
> > 2. In the VM, starts dd process, writing 20g to the think disk
> ...
> > 3. during the writing started backup
> > 
> > ./backup_vm.py -c engine full dc411543-5a9c-440e-9567-12b83214ea28
> ...
> > Actual results:
> >  ./backup_vm.py -c engine full b5621374-29b7-4d67-8928-9a8237699648
> > [   0.0 ] Starting full backup for VM 'b5621374-29b7-4d67-8928-9a8237699648'
> > [   1.7 ] Waiting until backup 'b7135713-df89-4325-8438-ec2e6bee7ec4' is
> > ready
> > [  29.3 ] Created checkpoint 'cc2580d6-b905-47e4-8b68-f202ba2e9472' (to use
> > in --from-checkpoint-uuid for the next incremental backup)
> > [  29.4 ] Downloading full backup for disk
> > '684cb233-c78c-4225-8356-1f106719694c'
> > [  30.7 ] Finalizing backup
> > Traceback (most recent call last):
> ...
> >   File "./backup_vm.py", line 407, in download_backup
> >     incremental=has_incremental)
> >   File "./backup_vm.py", line 425, in download_disk
> >     timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy))
> >   File
> > "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py",
> > line 235, in create_transfer
> >     .format(transfer.id, transfer.phase))
> > RuntimeError: Unexpected transfer 284a6ee2-0c19-4a0a-8799-a23f1564c9e0 phase
> > finalizing_failure
> 
> This is not an error in finalizing the backup, but an error in downloading 
> the disk 684cb233-c78c-4225-8356-1f106719694c. This is a high severity 
> issue.
> 
> I'm missing more info on the system state after this failure:
> - Did the transfer finish?
> - Did the backup finish?
> 
> If a transfer or backup get stuck, the disk and vm remain locked, these are
> urgent issues that should never happen.
@Sophie can you please check it out and answer Nir ?

Comment 7 sshmulev 2021-10-25 14:42:27 UTC

Adding more clarifications for reproducing the bug:

1) The "enable incremental backup" of the disk doesn't influence it - The issue can happen also when it's disabled.
2) When It fails in the transfer state, the disks of the VM stays in "finalizing_cleanup" state

/api/imagetransfers
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<image_transfers>
    <image_transfer href="/ovirt-engine/api/imagetransfers/27ee03f2-9da7-4635-a2dc-c6e0dd4525a9" id="27ee03f2-9da7-4635-a2dc-c6e0dd4525a9">
        <actions>
            <link href="/ovirt-engine/api/imagetransfers/27ee03f2-9da7-4635-a2dc-c6e0dd4525a9/resume" rel="resume"/>
            <link href="/ovirt-engine/api/imagetransfers/27ee03f2-9da7-4635-a2dc-c6e0dd4525a9/cancel" rel="cancel"/>
            <link href="/ovirt-engine/api/imagetransfers/27ee03f2-9da7-4635-a2dc-c6e0dd4525a9/pause" rel="pause"/>
            <link href="/ovirt-engine/api/imagetransfers/27ee03f2-9da7-4635-a2dc-c6e0dd4525a9/extend" rel="extend"/>
            <link href="/ovirt-engine/api/imagetransfers/27ee03f2-9da7-4635-a2dc-c6e0dd4525a9/finalize" rel="finalize"/>
        </actions>
        <active>false</active>
        <direction>download</direction>
        <format>raw</format>
        <inactivity_timeout>60</inactivity_timeout>
        <phase>finalizing_cleanup</phase>
        <shallow>false</shallow>
        <timeout_policy>cancel</timeout_policy>
        <transferred>0</transferred>
        <image id="5878643b-20d4-48aa-807a-1001418ae031"/>
    </image_transfer>
</image_transfers>


3) The backup is finished, checked via rest API /ovirt-engine/api/vms/vm-id/backups - was empty from backups.
4) The disk status is ok from rest API, only marked as: "finalizing_cleanup" - meaning the disks are not locked

So moving back the severity to high.

Comment 8 Nir Soffer 2021-10-25 15:27:20 UTC

So we have a transfer stuck in non final state.

This can break code expecting single image transfer per disk at the same time.

In 4.4.7 we fixed a horrible bug when same disk was reported multiple times
both in UI/API, because we had finished transfers, and the system was not 
filtering out finished transfers in when querying the database.

Please test:

1. We can perform more backups for this vm when we have a transfer stuck
   in this pahse

2. We can download that disk (after stopping the vm)

3. If we can download the disk, do we report the right status for 
   the disk during the download, or it reports "finalizing_cleanup"?

4. If we can download the disk, do we see duplicate disks in the UI/API?
   UI: vms > vm name -> disks
   API: /api/vms/vm-id/diskattachments/

Comment 9 sshmulev 2021-10-25 16:33:04 UTC

(In reply to Nir Soffer from comment #8)
> So we have a transfer stuck in non final state.
> 
> This can break code expecting single image transfer per disk at the same
> time.
> 
> In 4.4.7 we fixed a horrible bug when same disk was reported multiple times
> both in UI/API, because we had finished transfers, and the system was not 
> filtering out finished transfers in when querying the database.
> 
> Please test:
> 
> 1. We can perform more backups for this vm when we have a transfer stuck
>    in this phase
seems like a problem:
Repetedly have this error while trying to create another backup after this disk state:
# ./backup_vm.py -c engine full 2cc0f3d4-8328-4949-813a-07e1a6261576
[   0.0 ] Starting full backup for VM '2cc0f3d4-8328-4949-813a-07e1a6261576'
[   1.4 ] Waiting until backup '8293c90f-4fbd-42d3-be51-fc832e92c219' is ready
Traceback (most recent call last):
  File "./backup_vm.py", line 525, in <module>
    main()
  File "./backup_vm.py", line 191, in main
    args.command(args)
  File "./backup_vm.py", line 205, in cmd_full
    backup = start_backup(connection, args)
  File "./backup_vm.py", line 347, in start_backup
    backup = get_backup(connection, backup_service, backup.id)
  File "./backup_vm.py", line 519, in get_backup
    .format(backup_uuid, last_event))
RuntimeError: Backup 8293c90f-4fbd-42d3-be51-fc832e92c219 has failed, last reported event: {'code': 10792, 'description': 'Backup 8293c90f-4fbd-42d3-be51-fc832e92c219 for VM VM_2016913 failed (User: admin@internal-authz).'}

> 
> 2. We can download that disk (after stopping the VM)
> 
Passed
[root@storage-ge3-vdsm3 examples]# python3.6 download_disk.py -c engine 5878643b-20d4-48aa-807a-1001418ae031 /tmp/bla
[   0.0 ] Connecting...
[   0.2 ] Creating image transfer...
[   2.9 ] Transfer ID: 99a7c2f1-a5fe-4a40-a238-f907671b908f
[   2.9 ] Transfer host name: host_mixed_3
[   2.9 ] Downloading disk...
[ 100.00% ] 20.00 GiB, 51.01 seconds, 401.50 MiB/s                             
[  53.9 ] Finalizing image transfer...
[  57.9 ] Download completed successfully
[root@storage-ge3-vdsm3 examples]# 

> 3. If we can download the disk, do we report the right status for 
>    the disk during the download, or it reports "finalizing_cleanup"?
> 
We can download the disk, although another instance of this disk is created during the download (when it's done it's also gone), reports right to the new temporary one, but still keeps the one in the finalizing cleanup. - seems like the the disk is stuck on that "finalizing cleanup" state.

> 4. If we can download the disk, do we see duplicate disks in the UI/API?
>    UI: vms > vm name -> disks
>    API: /api/vms/vm-id/diskattachments/

The duplicate disks appear only during the download process itself, and then they are gone.

Comment 10 Nir Soffer 2021-10-25 16:39:19 UTC

Bumping severity based on comment 9.

Comment 11 Eyal Shenitzky 2021-10-26 04:55:59 UTC

(In reply to Benny Zlotnik from comment #3)
> 2021-10-21 16:59:36,412+03 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand]
> (default task-5) [3424710f-6a2a-4d41-8a03-5ae96996053a] START,
> AddImageTicketVDSCommand(HostName = host_mixed_2,
> AddImageTicketVDSCommandParameters:{hostId='eefe7565-0aa2-4e75-bc10-
> 7e99817aaef9', ticketId='e5dcc834-aa06-4c84-aa0f-dc36433dad53',
> timeout='300', operations='[read]', size='21474836480', url='null',
> filename='null', sparse='false',
> transferId='284a6ee2-0c19-4a0a-8799-a23f1564c9e0', dirty='false'}), log id:
> 27770105
> 2021-10-21 16:59:36,428+03 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand]
> (default task-5) [3424710f-6a2a-4d41-8a03-5ae96996053a] Failed in
> 'AddImageTicketVDS' method
> 2021-10-21 16:59:36,434+03 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (default task-5) [3424710f-6a2a-4d41-8a03-5ae96996053a] EVENT_ID:
> VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command
> AddImageTicketVDS failed: Image daemon request failed: "status=400,
> reason=Bad Request, error=Invalid ticket: Invalid value for 'url': None:
> expecting a <class 'str'> value\n"
> 
> The problem is that url was empty, this happened because the VM was running
> but was paused, so StartVmBackup performed cold backup but TransferDiskImage
> expects it to be a live backup because the VM is not down[1]
> 
>     private boolean isLiveBackup() {
>         return isBackup() && getBackupVm() != null &&
> !getBackupVm().isDown();
>     }
> 
> 
> [1]
> https://github.com/oVirt/ovirt-engine/blob/
> e921ac72c181872004c02db674ca835470149894/backend/manager/modules/bll/src/
> main/java/org/ovirt/engine/core/bll/storage/disk/image/
> TransferDiskImageCommand.java#L1123

Thanks Benny, you're right here.

Backup started when the VM was in 'UP' state and moved to 'PAUSED' exactly when the scratch disks for the live backup were created and just before the backup operation started - 

Scratch disks creation ended (for live backup only) - 

2021-10-21 16:59:33,171+03 INFO  [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-72) [d0b2e1ad-d59e-4b33-aa71-8f3cd2f34fa5] Scratch disks prepared for the backup

VM moved to 'PAUSED' state -

2021-10-21 16:59:34,746+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [7858da35] VM 'b5621374-29b7-4d67-8928-9a8237699648'(New_VM) moved from 'Up' --> 'Paused'
2021-10-21 16:59:34,762+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [7858da35] EVENT_ID: VM_PAUSED(1,025), VM New_VM has been paused.
2021-10-21 16:59:34,773+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [7858da35] EVENT_ID: VM_PAUSED_ENOSPC(138), VM New_VM has been paused due to no Storage space error.

So the engine now checks again if the backup is live/cold and the backup is taking accordingly, in this case, the VM 'PAUSED' and is considered as a cold backup so new bitmap added instead of using libvirt -

2021-10-21 16:59:35,248+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.AddVolumeBitmapCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-85) [d0b2e1ad-d59e-4b33-aa71-8f3cd2f34fa5] Running command: AddVolumeBitmapCommand internal: true.

Now, the VM got back to 'UP' state before the image transfer started - 

2021-10-21 16:59:35,864+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-27) [7858da35] VM 'b5621374-29b7-4d67-8928-9a8237699648'(New_VM) moved from 'Paused' --> 'Up'
2021-10-21 16:59:35,879+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-27) [7858da35] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM New_VM has recovered from paused back to up.

So the image transfer operation tries to download the backup as a live backup and not a cold backup and the disk's URL is missing.

The fix should include all the possibilities that a VM will enter a 'PAUSED' mode during the live backup operation and continue the live backup operation as it was for a running VM.

Should be a fairly simple fix.

Also, this flow can happen only if the VM paused right before the backup started, so it is a specific corner case.

Comment 12 sshmulev 2021-10-26 06:24:27 UTC

This can happen also with incremental backup:

./backup_vm.py -c engine incremental --from-checkpoint-uuid 76e329b7-0cf0-4811-afa4-d36431bf4f75 3e5f62b4-134a-42c4-bf39-46eb34042a16
[   0.0 ] Starting incremental backup for VM '3e5f62b4-134a-42c4-bf39-46eb34042a16'
[   1.6 ] Waiting until backup '6676d238-2f5e-4659-897e-9684c63fdfc8' is ready
[  18.8 ] Created checkpoint 'f4967aba-9534-4968-ae4e-e7a12ab334a5' (to use in --from-checkpoint-uuid for the next incremental backup)
[  18.9 ] Downloading incremental backup for disk 'cc45e439-5a0c-4fda-abc3-681743a6d0b7'
[  20.0 ] Finalizing backup
Traceback (most recent call last):
  File "./backup_vm.py", line 525, in <module>
    main()
  File "./backup_vm.py", line 191, in main
    args.command(args)
  File "./backup_vm.py", line 225, in cmd_incremental
    download_backup(connection, backup.id, args, incremental=True)
  File "./backup_vm.py", line 407, in download_backup
    incremental=has_incremental)
  File "./backup_vm.py", line 425, in download_disk
    timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy))
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 235, in create_transfer
    .format(transfer.id, transfer.phase))
RuntimeError: Unexpected transfer 6cfd393e-8c53-4f9e-9bb1-522927ce27ec phase finalizing_failure


The same thing happens - The disk is stuck in "finalizing_cleanup" state and it's impossible to do another incremental backup or full backup:
[   0.0 ] Starting incremental backup for VM '3e5f62b4-134a-42c4-bf39-46eb34042a16'
[   1.4 ] Waiting until backup 'e0231ac6-f27b-4bb6-8163-e38449088b1a' is ready
Traceback (most recent call last):
  File "./backup_vm.py", line 525, in <module>
    main()
  File "./backup_vm.py", line 191, in main
    args.command(args)
  File "./backup_vm.py", line 223, in cmd_incremental
    backup = start_backup(connection, args)
  File "./backup_vm.py", line 347, in start_backup
    backup = get_backup(connection, backup_service, backup.id)
  File "./backup_vm.py", line 519, in get_backup
    .format(backup_uuid, last_event))
RuntimeError: Backup e0231ac6-f27b-4bb6-8163-e38449088b1a has failed, last reported event: {'code': 10792, 'description': 'Backup e0231ac6-f27b-4bb6-8163-e38449088b1a for VM vm1 failed (User: admin@internal-authz).'}

./backup_vm.py -c engine full 3e5f62b4-134a-42c4-bf39-46eb34042a16
[   0.0 ] Starting full backup for VM '3e5f62b4-134a-42c4-bf39-46eb34042a16'
[   1.3 ] Waiting until backup '195e3422-ef92-47d6-9a2c-e5c53addd818' is ready
Traceback (most recent call last):
  File "./backup_vm.py", line 525, in <module>
    main()
  File "./backup_vm.py", line 191, in main
    args.command(args)
  File "./backup_vm.py", line 205, in cmd_full
    backup = start_backup(connection, args)
  File "./backup_vm.py", line 347, in start_backup
    backup = get_backup(connection, backup_service, backup.id)
  File "./backup_vm.py", line 519, in get_backup
    .format(backup_uuid, last_event))
RuntimeError: Backup 195e3422-ef92-47d6-9a2c-e5c53addd818 has failed, last reported event: {'code': 10792, 'description': 'Backup 195e3422-ef92-47d6-9a2c-e5c53addd818 for VM vm1 failed (User: admin@internal-authz).'}

Comment 14 sshmulev 2021-10-31 09:11:24 UTC

Verified.

Versions:
vdsm-4.40.90.4-1.el8ev.x86_64
ovirt-engine-4.4.9.4-0.1.el8ev.noarch

Steps to reproduce:
(Tried 5 times - all with success)
1. Start a VM with 20g thin disk
2. In the VM, starts dd process, writing 20g to the think disk

# dd if=/dev/zero bs=1M count=20480 of=/dev/sda oflag=direct conv=fsync status=progress

3. during the writing started backup

./backup_vm.py -c engine full <VM_ID>

4. ./backup_vm.py -c engine incremental --from-checkpoint-uuid <CHECKPOINT_ID> <VM_ID>

Engine logs:

2021-10-31 10:54:15,455+02 INFO  [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-20) [bf6ca867-4e43-464a-900d-7dd525cf4853] Scratch disks created for the backup
2021-10-31 10:54:29,347+02 INFO  [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-2) [bf6ca867-4e43-464a-900d-7dd525cf4853] Scratch disks prepared for the backup

2021-10-31 10:54:18,574+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-31) [3812da67] VM 'c453af58-10c8-4d7d-851a-605ec72fee4a'(vm3) moved from 'Up' --> 'Paused'
2021-10-31 10:54:19,450+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-17) [3812da67] VM 'c453af58-10c8-4d7d-851a-605ec72fee4a'(vm3) moved from 'Paused' --> 'Up'

2021-10-31 10:54:31,843+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] Creating ImageTransfer entity for command '238b39c2-d277-4c8b-a62f-2ffd30970805', proxyEnabled: true
2021-10-31 10:54:31,853+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] Successfully added Download disk 'latest-rhel-guest-image-8.5-infra' (disk id: '2b4ce1b9-c3e5-44d2-a8cb-133cd53349b0', image id: '090d0bda-4d08-4d5a-8e3b-fe6c820d8cc0') for image transfer command '238b39c2-d277-4c8b-a62f-2ffd30970805'
2021-10-31 10:54:31,887+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] START, AddImageTicketVDSCommand(HostName = host_mixed_2, AddImageTicketVDSCommandParameters:{hostId='281e0d9c-c7fd-4a19-9a4c-dd49a3379326', ticketId='9ef86998-500e-4c73-85ca-9e392aba8aa7', timeout='300', operations='[read]', size='10737418240', url='nbd:unix:/run/vdsm/backup/7b8f5974-0745-4856-84b6-50ce13061b12:exportname=vda', filename='null', sparse='false', transferId='238b39c2-d277-4c8b-a62f-2ffd30970805', dirty='false'}), log id: 342b3b4c
2021-10-31 10:54:31,898+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] FINISH, AddImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 342b3b4c
2021-10-31 10:54:31,898+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] Started transfer session with ticket id 9ef86998-500e-4c73-85ca-9e392aba8aa7, timeout 300 seconds
2021-10-31 10:54:31,898+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] Adding proxy image ticket, id 9ef86998-500e-4c73-85ca-9e392aba8aa7
2021-10-31 10:54:31,902+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] Started transfer session with transfer id 238b39c2-d277-4c8b-a62f-2ffd30970805, timeout 300 seconds
2021-10-31 10:54:31,911+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] Updating image transfer 238b39c2-d277-4c8b-a62f-2ffd30970805 (image 2b4ce1b9-c3e5-44d2-a8cb-133cd53349b0) phase to Transferring
2021-10-31 10:54:31,912+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] Returning from proceedCommandExecution after starting transfer session for image transfer command '238b39c2-d277-4c8b-a62f-2ffd30970805'
2021-10-31 10:54:31,922+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-26) [91924898-e812-4f54-8253-1c7132bfaead] EVENT_ID: TRANSFER_IMAGE_INITIATED(1,031), Image Download with disk latest-rhel-guest-image-8.5-infra was initiated by admin@internal-authz.

Comment 15 Sandro Bonazzola 2021-11-20 07:46:51 UTC

This bugzilla is included in  async update of oVirt engine (4.4.9.4) for oVirt 4.4.9 release, published on October 28th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.