Bug 1900518
| Summary: | [CBT][incremental backup] VM.stop_backup should succeed if VM does not exist | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Nir Soffer <nsoffer> | ||||
| Component: | Core | Assignee: | Eyal Shenitzky <eshenitz> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Amit Sharir <asharir> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.40.37 | CC: | aefrat, asharir, bugs, eshames, eshenitz, michal.skrivanek | ||||
| Target Milestone: | ovirt-4.4.9 | Keywords: | ZStream | ||||
| Target Release: | --- | Flags: | pm-rhel:
ovirt-4.4+
asharir: testing_plan_complete+ |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-10-21 07:27:14 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Blocks Veeam bug, let try to fix this for 4.4.5. Following my conversation with Eyal yesterday, adding the relevant flow I treid to do for verification.
Version:
vdsm-4.40.90.1-1.el8ev.x86_64
ovirt-engine-4.4.9-0.6.el8ev.noarch
Steps:
1. Create a VM
2. Add disk to the VM
3. Start "full" backup via SDK script (python3 backup_vm.py -c engine full <vm_id>)
4. During the initial stage of the backup - "poweroff" the VM (via guest terminal)
Actual results:
The backup got stuck on the waiting stage with the following message:
[ 0.0 ] Starting full backup for VM '737e2228-e86c-4527-bc44-a626d01a77be'
[ 2.6 ] Waiting until backup '6cac5b94-0afa-4b1a-a174-e92bffe624db' is ready
When trying to finalize the stuck backup (via post API call "https://<env>/ovirt-engine/api/vms/<vm_id>/backups/<backup_id>/finalize")- I got in an error response:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fault>
<detail>[Cannot stop VM backup. The VM backup is not in READY phase, backup phase is PREPARING_SCRATCH_DISK.]</detail>
<reason>Operation Failed</reason>
</fault>
(In reply to Amit Sharir from comment #3) > Following my conversation with Eyal yesterday, adding the relevant flow I > treid to do for verification. > > Version: > vdsm-4.40.90.1-1.el8ev.x86_64 > ovirt-engine-4.4.9-0.6.el8ev.noarch > > Steps: > 1. Create a VM > 2. Add disk to the VM > 3. Start "full" backup via SDK script (python3 backup_vm.py -c engine full > <vm_id>) > 4. During the initial stage of the backup - "poweroff" the VM (via guest > terminal) > > Actual results: > The backup got stuck on the waiting stage with the following message: > > > [ 0.0 ] Starting full backup for VM '737e2228-e86c-4527-bc44-a626d01a77be' > [ 2.6 ] Waiting until backup '6cac5b94-0afa-4b1a-a174-e92bffe624db' is > ready > > When trying to finalize the stuck backup (via post API call > "https://<env>/ovirt-engine/api/vms/<vm_id>/backups/<backup_id>/finalize")- > I got in an error response: > > <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > <fault> > <detail>[Cannot stop VM backup. The VM backup is not in READY phase, > backup phase is PREPARING_SCRATCH_DISK.]</detail> > <reason>Operation Failed</reason> > </fault> Following this comment - returning this bug to "assigned", since it failed QA. Version: vdsm-4.40.90.2-1.el8ev.x86_64 ovirt-engine-4.4.9.1-0.13.el8ev.noarch Verification Flow: I did the same flow mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1900518#c3 Verification Conclusions: The expected output matched the actual output. The total flow mentioned was done with no errors, I was able to finalize the backup (with API call) after I "powered off" the VM (via guest terminal) without any issues. Bug verified. This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. The fix to this bug causes a regression - bug 1985746 when backup is stopped after it was already finished. The current backup code is not ready to stop a backup if the backup is not in the READY phase. (In reply to Amit Sharir from comment #3) > Following my conversation with Eyal yesterday, adding the relevant flow I > treid to do for verification. ... > Steps: > 1. Create a VM > 2. Add disk to the VM > 3. Start "full" backup via SDK script (python3 backup_vm.py -c engine full > <vm_id>) > 4. During the initial stage of the backup - "poweroff" the VM (via guest > terminal) This is not the relevant flow reported for this bug. The issue that bug tried to resolve is not stopping a backup not in READY state, but stopping a backup after a vm was powered off during a backup. Stopping a backup is initializing phase is another gap that we need to resolve, but the code is not ready for this yet. To reproduce, you need to do: 1. Start a fulll backup using: ./backup_vm.py -c myengine start vm-id This will start a backup and exit when the backup is ready. 2. In the guest, power off the guest 3. Wait until oVirt detect that the vm is not running 4. Stop the backup ./backup_vm.py -c myengine stop vm-id backup-id Hi Nir,
I did the following flow on version : vdsm-4.40.100.1-1.el8ev.x86_64 / ovirt-engine-4.4.10-0.17.el8ev.noarch
1. Create VM from template + add disk of 5GB to the VM.
2. Start the VM
3. Start a full backup via VDSM using: ./backup_vm.py -c engine start vm-id
4. After the backup was ready, I validated the scratch disks were created successfully.
5. In the guest, power off the guest
6. Wait until oVirt detect that the vm is not running
7. Stopped the backup via VDSM using : ./backup_vm.py -c engine stop vm-id backup-id
Results:
On one hand, the Scratch disks were removed and the backup failed.
On the other hand, I got the following errors:
root@storage-ge2-vdsm1 examples]# ./backup_vm.py -c engine stop 9577471c-52e1-4db6-bc39-f1f9c8488857 e6ecdb08-7973-4d17-9407-730f520f0f87
[ 0.0 ] Finalizing backup 'e6ecdb08-7973-4d17-9407-730f520f0f87'
Traceback (most recent call last):
File "./backup_vm.py", line 525, in <module>
main()
File "./backup_vm.py", line 191, in main
args.command(args)
File "./backup_vm.py", line 278, in cmd_stop
stop_backup(connection, args.backup_uuid, args)
File "./backup_vm.py", line 363, in stop_backup
backup_service.finalize()
File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 33881, in finalize
return self._internal_action(action, 'finalize', None, headers, query, wait)
File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
return future.wait() if wait else future
File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
return self._code(response)
File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 296, in callback
self._check_fault(response)
File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
self._raise_error(response, body)
File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400.
Should I reopen this bug?
Avihai, if you reproduce this issue described in comment 8, please file another bug. |
Created attachment 1732400 [details] log showing typical failure Description of problem: If VM was destroyed during backup, for example powered off from the guest, or qemu process was killed, the backup job was aborted and there is no backup to stop. In this case vdsm fails with: 2020-11-22 18:37:25,033+0200 ERROR (jsonrpc/5) [api] FINISH stop_backup error=VM '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0' was not defined yet or was undefined (api:134) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 671, in stop_backup return self.vm.stop_backup(backup_id) File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 70, in wrapper return f(*a, **kw) File "<decorator-gen-265>", line 2, in stop_backup File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 101, in method return func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4006, in stop_backup return backup.stop_backup(self, dom, backup_id=backup_id) File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 211, in stop_backup _get_backup_xml(vm.id, dom, backup_id) File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 372, in _get_backup_xml backup_xml = dom.backupGetXMLDesc() File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call return getattr(self._vm._dom, name)(*a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__ % self.vmid) vdsm.virt.virdomain.NotConnectedError: VM '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0' was not defined yet or was undefined This special condition when we VM is not defined should be detected and handled as success. VM.stop_backup should log a message about the non-existing vm, and do the normal cleanup (e.g. removing scratch disks). With the current behavior engine can never clean up after the backup and the scratch disks are left in /var/lib/vdsm/storage/transientdisk/vm-id/backup-id. These files are never deleted and will finally fill up the disk. Version-Release number of selected component (if applicable): vdsm-4.40.35.1-202010291125.gita43f00858.el8.x86_64 (any version supporting backup has this issue) How reproducible: Very hard to reproduce (1 of 8 attempts). Usually this flow leads to hang in qemu. Steps to Reproduce: 1. Start full backup 2. In the guest, poweroff Actual results: Engine fails to stop the backup since vm was destroyed Expected results: Stopping backup succeeds since there is no vm (so no backup) and scratch disks are deleted. Additional info: The current code assumes that the VM must exists during a backup. This is true in general, but not in all cases.