Bug 1900594 - [CBT][incremental backup] Engine fail to stop backup if VM was destroyed during backup
Summary: [CBT][incremental backup] Engine fail to stop backup if VM was destroyed duri...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.4
: 4.4.4
Assignee: Eyal Shenitzky
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-23 12:31 UTC by Nir Soffer
Modified: 2021-01-12 16:23 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.4.4.5
Clone Of:
Environment:
Last Closed: 2021-01-12 16:23:59 UTC
oVirt Team: Storage
Embargoed:
tnisan: exception?


Attachments (Terms of Use)
Logs from 3 backups showing this issue (52.36 KB, application/gzip)
2020-11-23 12:31 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 112485 0 master MERGED core: persist backup host UUID 2020-12-27 13:35:19 UTC
oVirt gerrit 112486 0 master MERGED Backup: add the host that was used for the backup 2020-12-27 13:34:43 UTC
oVirt gerrit 112605 0 master MERGED Upgrade to model 4.4.22 2020-12-27 13:34:45 UTC

Description Nir Soffer 2020-11-23 12:31:59 UTC
Created attachment 1732487 [details]
Logs from 3 backups showing this issue

Description of problem:

If a vm was destroyed during backup, engine fail to stop the backup with
this error:

$ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 b7de98d7-914c-4c24-a9dd-a3202505e511
[   0.0 ] Finalizing backup b7de98d7-914c-4c24-a9dd-a3202505e511
Traceback (most recent call last):
  File "./backup_vm.py", line 431, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 247, in cmd_stop
    stop_backup(connection, args.backup_uuid, args)
  File "./backup_vm.py", line 329, in stop_backup
    backup_service.finalize()
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400.

In engine log we see:

2020-11-22 18:08:36,409+02 ERROR [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [59e594ba-825c-41c6-9b9e-0259ca331393] Command 'org.ovirt.engine.core.bll.StopVmBackupCommand' failed: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)
2020-11-22 18:08:36,409+02 ERROR [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [59e594ba-825c-41c6-9b9e-0259ca331393] Exception: java.lang.RuntimeException: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)

Version-Release number of selected component (if applicable):
4.4.4.2_master

How reproducible:
90%

Steps to Reproduce:
1. Start a full backup
2. Wait until download starts
3. In the guest, poweroff
4. In most case the qemu hangs, and the vm moves to uknown state. In some case the qemu exits the the
   system detects that vm is not running before the trying to stop backup.

Actual results:
Stopping backup fails with "Host not found" error.
The scratch disks are left on the host, this may cause the host disk to become full.

Expected results:
Engine succeed to stop backup even if the vm is not running
and scratch disks are removed.

Additional info:

The issue is assuming that we always have a running VM during backup, so 
storing the vm id in the vm_backups table is enough to locate the host 
when stopping the backup. But this assumption breaks when a vm is destroied
in the middle of a backup, for example when a user power of the vm inside
the guest. The system detect that the vm was terminated so trying to 
access the vm vds id will return empty id.

engine=# select * from vm_backups;
              backup_id               | from_checkpoint_id | to_checkpoint_id |                vm_id                 | phase |  
      _create_date        
--------------------------------------+--------------------+------------------+--------------------------------------+-------+--
--------------------------
 4ab7aec5-cc29-4a15-921b-d78180473b14 |                    |                  | 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 | Ready | 2
020-11-22 17:20:06.204+02

We need to add the host id to vm_backups table, so stopping a backup does not depend
on having a running vm.

This must be fixed to make incremental backup fully supported.

Comment 1 Ilan Zuckerman 2020-12-29 07:07:02 UTC
Verified on rhv-4.4.4-6

> Steps to reproduce with backup_vm.py:
> 
> 1. Create a VM with a disk that enabled for incremental backup.
> 2. Start the VM.
> 3. Start a backup for the VM using - backup_bm.py -c engine start <vm-id>

[root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine start 89502d53-710a-4cd2-85ab-35a934f5f306
[   0.0 ] Starting full backup for VM '89502d53-710a-4cd2-85ab-35a934f5f306'
[   2.7 ] Waiting until backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c is ready
[   5.8 ] Created checkpoint 'd7523bd4-dd15-45ce-9cef-7430693344b8' (to use in --from-checkpoint-uuid for the next incremental backup)
[   6.1 ] Backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c is ready


> 4. Power-off the VM
> 5. Stop the VM backup using - backup_bm.py -c engine stop <vm-id>

[root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine stop 89502d53-710a-4cd2-85ab-35a934f5f306 bfbb9494-dc76-47d6-ac02-25a619bc6f8c
[   0.0 ] Finalizing backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c
[   0.4 ] Backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c has finalized



Actual result (as expected):

Engine succeed to stop backup even if the vm is not running
engine=# select * from vm_backups;
(0 rows)



Engine reports:

EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM test finalized 
.
.
2020-12-16 16:31:27,190+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-4) [d3716982-6cf4-44c6-9503-ebfea5323545] EVENT_ID: VM_BACKUP_SUCCEEDED(10,793), Backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c for VM test completed successfully (User: admin@internal-authz).

Comment 2 Sandro Bonazzola 2021-01-12 16:23:59 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.