Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1900518

Summary: [CBT][incremental backup] VM.stop_backup should succeed if VM does not exist
Product: [oVirt] vdsm Reporter: Nir Soffer <nsoffer>
Component: CoreAssignee: Eyal Shenitzky <eshenitz>
Status: CLOSED CURRENTRELEASE QA Contact: Amit Sharir <asharir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.40.37CC: aefrat, asharir, bugs, eshames, eshenitz, michal.skrivanek
Target Milestone: ovirt-4.4.9Keywords: ZStream
Target Release: ---Flags: pm-rhel: ovirt-4.4+
asharir: testing_plan_complete+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-21 07:27:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log showing typical failure none

Description Nir Soffer 2020-11-23 09:23:34 UTC
Created attachment 1732400 [details]
log showing typical failure

Description of problem:

If VM was destroyed during backup, for example powered off from the guest,
or qemu process was killed, the backup job was aborted and there is no
backup to stop.

In this case vdsm fails with:

2020-11-22 18:37:25,033+0200 ERROR (jsonrpc/5) [api] FINISH stop_backup error=VM '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0' was not defined yet or was undefined (api:134)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 671, in stop_backup
    return self.vm.stop_backup(backup_id)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 70, in wrapper
    return f(*a, **kw)
  File "<decorator-gen-265>", line 2, in stop_backup
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 101, in method
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4006, in stop_backup
    return backup.stop_backup(self, dom, backup_id=backup_id)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 211, in stop_backup
    _get_backup_xml(vm.id, dom, backup_id)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 372, in _get_backup_xml
    backup_xml = dom.backupGetXMLDesc()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call
    return getattr(self._vm._dom, name)(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__
    % self.vmid)
vdsm.virt.virdomain.NotConnectedError: VM '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0' was not defined yet or was undefined

This special condition when we VM is not defined should be detected and
handled as success.

VM.stop_backup should log a message about the non-existing vm, and do the
normal cleanup (e.g. removing scratch disks).

With the current behavior engine can never clean up after the backup and
the scratch disks are left in /var/lib/vdsm/storage/transientdisk/vm-id/backup-id. These files are never deleted and will finally fill up the
disk.

Version-Release number of selected component (if applicable):
vdsm-4.40.35.1-202010291125.gita43f00858.el8.x86_64
(any version supporting backup has this issue)

How reproducible:
Very hard to reproduce (1 of 8 attempts).
Usually this flow leads to hang in qemu.

Steps to Reproduce:
1. Start full backup
2. In the guest, poweroff

Actual results:
Engine fails to stop the backup since vm was destroyed

Expected results:
Stopping backup succeeds since there is no vm (so no backup)
and scratch disks are deleted.

Additional info:
The current code assumes that the VM must exists during a backup. This is
true in general, but not in all cases.

Comment 1 Nir Soffer 2020-11-23 10:15:09 UTC
Blocks Veeam bug, let try to fix this for 4.4.5.

Comment 3 Amit Sharir 2021-10-04 07:11:21 UTC
Following my conversation with Eyal yesterday, adding the relevant flow I treid to do for verification. 

Version:
vdsm-4.40.90.1-1.el8ev.x86_64
ovirt-engine-4.4.9-0.6.el8ev.noarch

Steps:
1. Create a VM
2. Add disk to the VM
3. Start "full" backup via SDK script (python3 backup_vm.py -c engine full <vm_id>)
4. During the initial stage of the backup - "poweroff" the VM (via guest terminal)

Actual results: 
The backup got stuck on the waiting stage with the following message: 


[   0.0 ] Starting full backup for VM '737e2228-e86c-4527-bc44-a626d01a77be'
[   2.6 ] Waiting until backup '6cac5b94-0afa-4b1a-a174-e92bffe624db' is ready

When trying to finalize the stuck backup (via post API call "https://<env>/ovirt-engine/api/vms/<vm_id>/backups/<backup_id>/finalize")- I got in an error response:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fault>
    <detail>[Cannot stop VM backup. The VM backup is not in READY phase, backup phase is PREPARING_SCRATCH_DISK.]</detail>
    <reason>Operation Failed</reason>
</fault>

Comment 4 Amit Sharir 2021-10-04 12:06:06 UTC
(In reply to Amit Sharir from comment #3)
> Following my conversation with Eyal yesterday, adding the relevant flow I
> treid to do for verification. 
> 
> Version:
> vdsm-4.40.90.1-1.el8ev.x86_64
> ovirt-engine-4.4.9-0.6.el8ev.noarch
> 
> Steps:
> 1. Create a VM
> 2. Add disk to the VM
> 3. Start "full" backup via SDK script (python3 backup_vm.py -c engine full
> <vm_id>)
> 4. During the initial stage of the backup - "poweroff" the VM (via guest
> terminal)
> 
> Actual results: 
> The backup got stuck on the waiting stage with the following message: 
> 
> 
> [   0.0 ] Starting full backup for VM '737e2228-e86c-4527-bc44-a626d01a77be'
> [   2.6 ] Waiting until backup '6cac5b94-0afa-4b1a-a174-e92bffe624db' is
> ready
> 
> When trying to finalize the stuck backup (via post API call
> "https://<env>/ovirt-engine/api/vms/<vm_id>/backups/<backup_id>/finalize")-
> I got in an error response:
> 
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <fault>
>     <detail>[Cannot stop VM backup. The VM backup is not in READY phase,
> backup phase is PREPARING_SCRATCH_DISK.]</detail>
>     <reason>Operation Failed</reason>
> </fault>

Following this comment - returning this bug to "assigned", since it failed QA.

Comment 5 Amit Sharir 2021-10-11 06:10:56 UTC
Version: 
vdsm-4.40.90.2-1.el8ev.x86_64
ovirt-engine-4.4.9.1-0.13.el8ev.noarch

Verification Flow: 
I did the same flow mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1900518#c3

Verification Conclusions:
The expected output matched the actual output.
The total flow mentioned was done with no errors, I was able to finalize the backup (with API call) after I "powered off" the VM (via guest terminal) without any issues.


Bug verified.

Comment 6 Sandro Bonazzola 2021-10-21 07:27:14 UTC
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 7 Nir Soffer 2021-12-20 16:55:35 UTC
The fix to this bug causes a regression - bug 1985746 when
backup is stopped after it was already finished.

The current backup code is not ready to stop a backup if the backup
is not in the READY phase.

Comment 8 Nir Soffer 2021-12-20 17:05:38 UTC
(In reply to Amit Sharir from comment #3)
> Following my conversation with Eyal yesterday, adding the relevant flow I
> treid to do for verification. 
...
> Steps:
> 1. Create a VM
> 2. Add disk to the VM
> 3. Start "full" backup via SDK script (python3 backup_vm.py -c engine full
> <vm_id>)
> 4. During the initial stage of the backup - "poweroff" the VM (via guest
> terminal)

This is not the relevant flow reported for this bug.

The issue that bug tried to resolve is not stopping a backup not in
READY state, but stopping a backup after a vm was powered off during
a backup.

Stopping a backup is initializing phase is another gap that we need to
resolve, but the code is not ready for this yet.

To reproduce, you need to do:

1. Start a fulll backup using:

    ./backup_vm.py -c myengine start vm-id

This will start a backup and exit when the backup is ready.

2. In the guest, power off the guest

3. Wait until oVirt detect that the vm is not running

4. Stop the backup

    ./backup_vm.py -c myengine stop vm-id backup-id

Comment 9 Amit Sharir 2021-12-22 10:16:52 UTC
Hi Nir, 

I did the following flow on version : vdsm-4.40.100.1-1.el8ev.x86_64 / ovirt-engine-4.4.10-0.17.el8ev.noarch

1. Create VM from template + add disk of 5GB to the VM.
2. Start the VM
3. Start a full backup via VDSM using: ./backup_vm.py -c engine start vm-id
4. After the backup was ready, I validated the scratch disks were created successfully.
5. In the guest, power off the guest
6. Wait until oVirt detect that the vm is not running
7. Stopped the backup via VDSM using : ./backup_vm.py -c engine stop vm-id backup-id


Results: 

On one hand, the Scratch disks were removed and the backup failed.
On the other hand, I got the following errors: 

root@storage-ge2-vdsm1 examples]# ./backup_vm.py -c engine stop 9577471c-52e1-4db6-bc39-f1f9c8488857 e6ecdb08-7973-4d17-9407-730f520f0f87
[   0.0 ] Finalizing backup 'e6ecdb08-7973-4d17-9407-730f520f0f87'
Traceback (most recent call last):
  File "./backup_vm.py", line 525, in <module>
    main()
  File "./backup_vm.py", line 191, in main
    args.command(args)
  File "./backup_vm.py", line 278, in cmd_stop
    stop_backup(connection, args.backup_uuid, args)
  File "./backup_vm.py", line 363, in stop_backup
    backup_service.finalize()
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 33881, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400.



Should I reopen this bug?

Comment 10 Nir Soffer 2022-03-25 09:59:56 UTC
Avihai, if you reproduce this issue described in comment 8, please
file another bug.