Created attachment 1732461 [details] logs from 3 backups showing this issue Description of problem: If VM hanges in the middle of a backup (e.g. bug 1892672), and the backup application ask to finalize the backup, stopping the backup fails since qemu is hang (see bug 1900505). The only way to recover is to shutdown or poweroff the vm, but this is block by the backend, so the operation fail with: Error while executing action: backup-raw: Cannot shutdown VM. The VM is during a backup operation. The result is that the only way to recover this VM is to kill qemu manually. There are other case when user may like to shutdown a VM during backup, without waiting for backup completion, which can take hours with a huge vm. In the UI, users should see a warning that shutting down/powering off a vm will abort the current backup. If a users want to preform the operation, they can confirm the operation in the same way they confirm that a host was rebooted, and have the backup terminated. In the SDK, users should be able to change VM state during backup by providing some kind of force= flag. Version-Release number of selected component (if applicable): 4.4.4.2_master How reproducible: 50% Steps to Reproduce: 1. Start backup 2. Wait until downloads starts 3. In the guest, poweroff Actual results: qemu hangs, vm left in unknown status forever. The only way to recover is to kill the qemu process. Expected results: User can shutdown or power off the VM to recover the hang vm.
Hi Nir, Eyal, my reproduction of that kind of behavior in rhv-release-4.4.5-2 is a little bit different: Steps I did: 1. create VM out of template and start it [root@storage-ge13-vdsm3 examples]# python3 backup_vm.py -c engine start 02fa277b-4b7f-46b4-9618-57ea1c69c77a [ 0.0 ] Starting full backup for VM '02fa277b-4b7f-46b4-9618-57ea1c69c77a' [ 1.3 ] Waiting until backup e331442b-02c8-43b3-b94b-40bf92e322f4 is ready [ 2.3 ] Backup e331442b-02c8-43b3-b94b-40bf92e322f4 is ready 3. Issue download disks And Just as soon as the download starts, poweroff from within the guest: [root@storage-ge13-vdsm3 examples]# python3 backup_vm.py -c engine download 02fa277b-4b7f-46b4-9618-57ea1c69c77a --backup-uuid e331442b-02c8-43b3-b94b-40bf92e322f4 [ 0.0 ] Downloading VM 02fa277b-4b7f-46b4-9618-57ea1c69c77a disks [ 0.1 ] Creating image transfer for disk 7cc001bb-0ad1-4d3f-bfac-1d145ee50433 [ 1.3 ] Image transfer f1e2e141-a4d6-4ec9-be38-bbdcb2932b29 is ready [ 83.02% ] 8.30 GiB, 118.72 seconds, 71.61 MiB/s [ 120.0 ] Finalizing image transfer Traceback (most recent call last): File "backup_vm.py", line 428, in <module> main() File "backup_vm.py", line 161, in main args.command(args) File "backup_vm.py", line 232, in cmd_download connection, args.backup_uuid, args, incremental=args.incremental) File "backup_vm.py", line 354, in download_backup download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental) File "backup_vm.py", line 397, in download_disk **extra_args) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 186, in download name="download") File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy log.debug("Executor failed") File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__ self.stop() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop raise self._errors[0] File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run handler.copy(req) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 215, in write_to .format(length, length - todo)) It took around ~2 minutes for VM terminal to be terminated after the 'poweroff' command. After the VM was shut down, I couldnt start it with the engine UI, getting the same kind of error that you reported: "Cannot run VM. The VM is during a backup operation." I was able to remove the VM without problem Regarding the hang qemu process, it wasnt found after the VM shut down. the process was killed as the VM powered off. See the video of shutdown process here: https://drive.google.com/file/d/1OL9stxiburm6nWNUtlSh-pHkg3PjrKhf/view?usp=sharing So its not quite the same as you reported in the description. Please review my steps, did i miss something? Could this be considered as a reproduction?
There is a much simpler way to verify this bug. The fix here is to add an option to power-off the VM even if a backup is running for it. So the steps are - 1. Run a VM with a disk 2. Start a backup for it 3. When the backup is running, try to power-off the VM via the UI -> failed with proper error for running backup. 4. Try to power-off/shutdown/reboot the VM from the REST-API using the following 'force' flag in the request - POST /ovirt-engine/api/vms/123/(shutdown/stop/reboot) <action> <force>true</force> </action> 5. The VM should power-off/shutdown/rebooted.
Verified on rhv-4.4.5-5 according steps on comment #2 In addition, checked the backup state of the VM after each state + finalizing the backup when VM is Down
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.