1900564 – [CBT][incremental backup] Engine cannot stop backup because VM is hang, cannot destroy VM because backup is running

Bug 1900564 - [CBT][incremental backup] Engine cannot stop backup because VM is hang, cannot destroy VM because backup is running

Summary: [CBT][incremental backup] Engine cannot stop backup because VM is hang, canno...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.4.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.5
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Ilan Zuckerman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-23 11:45 UTC by Nir Soffer
Modified:	2021-03-22 12:55 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-03-18 15:12:42 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs from 3 backups showing this issue (73.83 KB, application/gzip) 2020-11-23 11:45 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	112819	master	MERGED	core: add force flag to stop\shutdown VM operations	2021-02-14 05:47:30 UTC
oVirt gerrit	112820	master	MERGED	core: add force flag to reboot VM operation	2021-02-14 05:47:30 UTC
oVirt gerrit	112821	master	MERGED	VmService: add force flag to stop\shutdown VM operations	2021-02-14 05:47:30 UTC
oVirt gerrit	112822	master	MERGED	VmService: add force flag to reboot VM operation	2021-02-14 05:47:30 UTC
oVirt gerrit	113163	master	MERGED	Update to model 4.4.23, metamodel 1.3.4	2021-02-14 05:47:30 UTC

Description Nir Soffer 2020-11-23 11:45:00 UTC

Created attachment 1732461 [details]
logs from 3 backups showing this issue

Description of problem:

If VM hanges in the middle of a backup (e.g. bug 1892672), and the backup
application ask to finalize the backup, stopping the backup fails since
qemu is hang (see bug 1900505). The only way to recover is to shutdown or
poweroff the vm, but this is block by the backend, so the operation fail
with:

   Error while executing action:

   backup-raw:
   Cannot shutdown VM. The VM is during a backup operation.

The result is that the only way to recover this VM is to kill qemu manually.

There are other case when user may like to shutdown a VM during backup,
without waiting for backup completion, which can take hours with a huge vm.

In the UI, users should see a warning that shutting down/powering off a vm
will abort the current backup. If a users want to preform the operation,
they can confirm the operation in the same way they confirm that a host
was rebooted, and have the backup terminated.

In the SDK, users should be able to change VM state during backup by
providing some kind of force= flag. 

Version-Release number of selected component (if applicable):
4.4.4.2_master

How reproducible:
50%

Steps to Reproduce:
1. Start backup
2. Wait until downloads starts
3. In the guest, poweroff

Actual results:
qemu hangs, vm left in unknown status forever. The only way to recover is to
kill the qemu process.

Expected results:
User can shutdown or power off the VM to recover the hang vm.

Comment 1 Ilan Zuckerman 2021-01-25 15:25:54 UTC

Hi Nir, Eyal, my reproduction of that kind of behavior in rhv-release-4.4.5-2 is a little bit different:

Steps I did:
1. create VM out of template and start it

[root@storage-ge13-vdsm3 examples]# python3 backup_vm.py -c engine start 02fa277b-4b7f-46b4-9618-57ea1c69c77a
[   0.0 ] Starting full backup for VM '02fa277b-4b7f-46b4-9618-57ea1c69c77a'
[   1.3 ] Waiting until backup e331442b-02c8-43b3-b94b-40bf92e322f4 is ready
[   2.3 ] Backup e331442b-02c8-43b3-b94b-40bf92e322f4 is ready


3. Issue download disks And Just as soon as the download starts, poweroff from within the guest:
[root@storage-ge13-vdsm3 examples]# python3 backup_vm.py -c engine download 02fa277b-4b7f-46b4-9618-57ea1c69c77a  --backup-uuid e331442b-02c8-43b3-b94b-40bf92e322f4
[   0.0 ] Downloading VM 02fa277b-4b7f-46b4-9618-57ea1c69c77a disks
[   0.1 ] Creating image transfer for disk 7cc001bb-0ad1-4d3f-bfac-1d145ee50433
[   1.3 ] Image transfer f1e2e141-a4d6-4ec9-be38-bbdcb2932b29 is ready
[  83.02% ] 8.30 GiB, 118.72 seconds, 71.61 MiB/s                              
[ 120.0 ] Finalizing image transfer
Traceback (most recent call last):
  File "backup_vm.py", line 428, in <module>
    main()
  File "backup_vm.py", line 161, in main
    args.command(args)
  File "backup_vm.py", line 232, in cmd_download
    connection, args.backup_uuid, args, incremental=args.incremental)
  File "backup_vm.py", line 354, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "backup_vm.py", line 397, in download_disk
    **extra_args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 186, in download
    name="download")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 215, in write_to
    .format(length, length - todo))


It took around ~2 minutes for VM terminal to be terminated after the 'poweroff' command.
After the VM was shut down, I couldnt start it with the engine UI, getting the same kind of error that you reported:
"Cannot run VM. The VM is during a backup operation."

I was able to remove the VM without problem

Regarding the hang qemu process, it wasnt found after the VM shut down. the process was killed as the VM powered off.

See the video of shutdown process here:
https://drive.google.com/file/d/1OL9stxiburm6nWNUtlSh-pHkg3PjrKhf/view?usp=sharing

So its not quite the same as you reported in the description.
Please review my steps, did i miss something?
Could this be considered as a reproduction?

Comment 2 Eyal Shenitzky 2021-01-26 12:07:04 UTC

There is a much simpler way to verify this bug.

The fix here is to add an option to power-off the VM even if a backup is running for it.
So the steps are - 

1. Run a VM with a disk
2. Start a backup for it
3. When the backup is running, try to power-off the VM via the UI -> failed with proper error for running backup.
4. Try to power-off/shutdown/reboot the VM from the REST-API using the following 'force' flag in the request -

POST /ovirt-engine/api/vms/123/(shutdown/stop/reboot)

<action>
    <force>true</force>
</action>

5. The VM should power-off/shutdown/rebooted.

Comment 3 Ilan Zuckerman 2021-02-17 06:49:57 UTC

Verified on rhv-4.4.5-5 according steps on comment #2
In addition, checked the backup state of the VM after each state + finalizing the backup when VM is Down

Comment 4 Sandro Bonazzola 2021-03-18 15:12:42 UTC

This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 5 Sandro Bonazzola 2021-03-22 12:55:29 UTC

This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.