1880251 – VM stuck in "reboot in progress" ("virtual machine XXX should be running in a host but it isn't.").

Bug 1880251 - VM stuck in "reboot in progress" ("virtual machine XXX should be running in a host but it isn't.").

Summary: VM stuck in "reboot in progress" ("virtual machine XXX should be running in a...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.4.2.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.4
Target Release:	4.4.4
Assignee:	Arik
QA Contact:	Qin Yuan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1892416 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-18 05:18 UTC by Gilboa Davara
Modified:	2021-01-12 16:23 UTC (History)
CC List:	4 users (show)
Fixed In Version:	ovirt-engine-4.4.4
Clone Of:
Environment:
Last Closed:	2021-01-12 16:23:46 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+

Attachments	(Terms of Use)
Engine log. (486.61 KB, application/gzip) 2020-09-18 05:18 UTC, Gilboa Davara	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	111880	master	MERGED	core: prevent status=RebootInProgress and run_on_vds=null	2021-02-12 16:39:17 UTC
oVirt gerrit	111881	master	MERGED	core: initiate down vms handling before restarting vms	2021-02-12 16:39:17 UTC
oVirt gerrit	111882	master	MERGED	core: prevent potential state of a running vm with run_on_vds = null	2021-02-12 16:39:17 UTC

Description Gilboa Davara 2020-09-18 05:18:05 UTC

Created attachment 1715300 [details]
Engine log.

Description of problem:
Self hosted engine, single node, NFS.
Attempted to install CentOS over an exiting Fedora VM with one host device (USB printer).
Reboot failed, trying to boot from a non-existent CDROM.
Tried shutting the VM down, failed.
Tried powering off the VM, failed.
Dropped cluster to global maintenance, reboot host + engine (was planning to upgrade it anyhow...), VM still stuck.

When trying to power off the VM, the following message can be found the in engine.log:
2020-09-18 07:58:51,439+03 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-42) [7bc4ac71-f0b2-4af7-b081-100dc99b6123] Running command: StopVmCommand internal: false. Entities affected :  ID: b411e573-bcda-4689-b61f-1811c6f03ad5 Type: VMAction group STOP_VM with role type USER
2020-09-18 07:58:51,441+03 WARN  [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-42) [7bc4ac71-f0b2-4af7-b081-100dc99b6123] Strange, according to the status 'RebootInProgress' virtual machine 'b411e573-bcda-4689-b61f-1811c6f03ad5' should be running in a host but it isn't.
2020-09-18 07:58:51,594+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-42) [7bc4ac71-f0b2-4af7-b081-100dc99b6123] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM kids-home-srv (Host: <UNKNOWN>, User: gilboa@internal-authz).

Version-Release number of selected component (if applicable):
4.4.2.6

How reproducible:
N/A.

Steps to Reproduce (doubt it..):
1. Create Fedora 32 VM with one USB device attached.
2. Attempt to install CentOS 8 on it.

Actual results:
Zombie VM.

Expected results:
Either or a working VM, or the manual method to "kill it" (most likely I simply need to zap a row in the DB...)

Additional info:
N/A

Comment 1 RHEL Program Management 2020-09-19 10:48:48 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Arik 2020-09-19 16:33:24 UTC

The VM was in powering-on state at the moment 'reboot' was triggered - in that case, we don't initiate a call to shutdown the vm but rather power-off. It seems like the VM got to Down state with reason that doesn't indicate that it's part of the reboot process and due to what seems to be a bug on the engine side, we have set the VM status to RebootInProgress after the run_on_vds field was cleared.

Milan, could it be that when we call power-off (graceful=false) we won't get an exit code/reason that indicates that it's part of a reboot process? In other cases I see "Exit message: Down as a part of the reboot process" that I don't see in this case

Comment 3 Milan Zamazal 2020-09-24 16:59:20 UTC

"Down as a part of the reboot process" is set only when a reboot is requested and the VM gets powered off (rather than rebooted) due to destroy-on-reboot flag (and even then only if libvirt sends a reboot event).

When VM.shutdown(reboot=True) is called from Engine, Vdsm sets Vm._guestEvent to REBOOT_IN_PROGRESS. When the reboot doesn't complete and Engine then calls VM.destroy, the exit reason is set to ADMIN_SHUTDOWN. Vdsm doesn't try to associate the two calls although in theory it could do so; nevertheless Engine should already be aware about the sequence itself and as the "smarter" component, it shouldn't require Vdsm assistance.

Comment 4 Arik 2020-09-24 19:35:54 UTC

Right, so here we talk only on the case that reboot is requested and the VM gets powered off due to destroy-on-reboot.

From the user's perspective it happened as follow:
1. Starts a VM
2. While the VM is powering on, editing the VM in a way that creates next-run configuration
3. While the VM is still powering on, request reboot

What happens behind the scenes:
1. The VM starts
2. The engine creates next-run snapshots and sets destroy-on-reboot (to apply the next-run configuration)
3. The engine gets to the 'reboot' command that detects that there's next-run configuration and therefore calls 'shutdown'
4. Normally, 'shutdown' calls destroy with gracefully=true which I assume results in VM.shutdown(reboot=True) in VDSM but as the status of the VM is not UP (but PoweringUp) we call 'destroy' instead
5. The destroy call probably triggers an event that makes the engine update the VM to Down so quickly that it's done before setting the VM with status=RebootInProgress (I managed to reproduce this)

The major issue here is that it leads to having a VM with run_on_vds=null (a result of processing the Down event) and status=RebootInProgress
This can be avoided on the engine side by preventing the status from changing when run_on_vds is null but then it would be really difficult to make the engine start the VM again.

The engine currently relies on the exit status of the VM to prevent the following:
1. The user asks to reboot from the webadmin/portal
2. In about the same time, another user shuts down the VM from within the guest
3. The VM goes down due to the request from within the guest
4. The engine sees the VM is down and starts it because of the 'reboot' process

Comment 5 Arik 2020-09-24 19:39:44 UTC

(In reply to Arik from comment #4)
> The engine currently relies on the exit status of the VM to prevent the
> following:
> 1. The user asks to reboot from the webadmin/portal
> 2. In about the same time, another user shuts down the VM from within the
> guest
> 3. The VM goes down due to the request from within the guest
> 4. The engine sees the VM is down and starts it because of the 'reboot'
> process

Actually no, I think the flow that should have been avoided was:
1. The user asks to shutdown from within the guest
2. It takes time for the engine to get updated and while it still shows the VM as UP, the user asks to reboot from the webadmin/portal
3. The VM goes down due to the request from within the guest
4. The engine sees the VM is down and starts it because of the 'reboot' process

Comment 7 Milan Zamazal 2020-10-22 12:25:08 UTC

(In reply to Arik from comment #5)

> Actually no, I think the flow that should have been avoided was:
> 1. The user asks to shutdown from within the guest
> 2. It takes time for the engine to get updated and while it still shows the
> VM as UP, the user asks to reboot from the webadmin/portal
> 3. The VM goes down due to the request from within the guest
> 4. The engine sees the VM is down and starts it because of the 'reboot'
> process

It depends on whether the VM is already down when the reboot is called. If the user initiates shutdown in the guest and then reboot is called before the VM switches off, it can still result in VM reboot. I'm not sure whether it's a bug or feature, it's probably not important. However, a related question is whether Engine really must rely on the exit reason provided by Vdsm.

We've discussed offline a possibility of handling this bug in Vdsm, by changing Vdsm to report reboot exit reason in this scenario. But let's consider this:

1. A VM is started.
2. Its configuration is changed for next run.
3. While the VM is still powering up, the user requests either reboot or power off in the Web UI.

The problem is, if I understand and observe it correctly, that in 3., Engine calls destroy (rather than shutdown) command on both reboot and power off. That means Vdsm can't distinguish between reboot and power off and thus can't provide the corresponding exit reason. So it looks like Vdsm can't assist Engine and we'll have to fix Engine some way.

Arik, what do you think about it?

Comment 8 Arik 2020-10-22 21:09:52 UTC

Yes, that makes sense. I misinterpreted that meaning of the destroyonreboot flag (I probably reviewed the code when it was introduced but managed to forget :) )

So ok, I understand that this flag is intended for reboot from within the guest - if there's next-run configuration involved, then we ask VDSM to convert that reboot to destroy
and the VM is then reported with exit-reason=reboot so the engine would know that it needs to restart the VM (as opposed to when shutdown is called from within the guest)
did I get this right?
I wonder then why in case of rebooting a VM that is set with next-run configuration from the engine, the engine bothers to call shutdown without reboot=true - can't we call shutdown+reboot=True and expect VDSM to call 'reboot' and do the same as the above?

But as for this bug, indeed - VDSM misses the context of whether or not the 'destroy' call is part of a reboot process.
The engine is actually prepared for that case - it sets the cold-reboot flag in VmManager but I guess that now that we use events (that were introduced after the 'reboot' feature), we tend to get the Down event before that flag is set and before changing the status of the VM to RebootInProgress - and that seems like an easy fix on the engine side.

Comment 9 Milan Zamazal 2020-10-23 11:21:10 UTC

(In reply to Arik from comment #8)

> So ok, I understand that this flag is intended for reboot from within the
> guest - if there's next-run configuration involved, then we ask VDSM to
> convert that reboot to destroy
> and the VM is then reported with exit-reason=reboot so the engine would know
> that it needs to restart the VM (as opposed to when shutdown is called from
> within the guest)
> did I get this right?

Yes, if the VM happens to get rebooted, Vdsm destroys it with DestroyedOnReboot exit reason.

> I wonder then why in case of rebooting a VM that is set with next-run
> configuration from the engine, the engine bothers to call shutdown without
> reboot=true - can't we call shutdown+reboot=True and expect VDSM to call
> 'reboot' and do the same as the above?

OTOH Engine is the primary place where to implement logic and why to bother with a reboot and to rely on its handling in Vdsm when the Engine already knows that a restart is the right thing?

> But as for this bug, indeed - VDSM misses the context of whether or not the
> 'destroy' call is part of a reboot process.
> The engine is actually prepared for that case - it sets the cold-reboot flag
> in VmManager but I guess that now that we use events (that were introduced
> after the 'reboot' feature), we tend to get the Down event before that flag
> is set and before changing the status of the VM to RebootInProgress - and
> that seems like an easy fix on the engine side.

OK.

Comment 10 Arik 2020-11-03 17:20:40 UTC

*** Bug 1892416 has been marked as a duplicate of this bug. ***

Comment 11 Casper (RHV QE bot) 2020-11-13 11:02:18 UTC

This bug has a low overall severity and passed automated regreession suite, it is not going to be further verified by QE. If you believe a special care is required feel free to re-open to ON_QA status

Comment 12 Sandro Bonazzola 2020-11-13 15:52:57 UTC

I see severity and priority for this bug have not been set. Can you please check if this bug deserve more attention from QE?

Comment 13 Arik 2020-11-15 08:08:03 UTC

The following scenario needs to be tested:
1. start a VM that is not installed with qemu-guest-agent (so it will take the VM 60 sec to reach UP state)
2. right after starting the VM, edit the VM to produce next-run configuration (e.g., add/remove ballooning)
3. before the VM reaches UP state, reboot the VM
Previously this resulted in the VM being stuck in RebootInProgress state and the host it runs on was null (empty in the UI) many times and now that shouldn't happen anymore

Comment 14 Qin Yuan 2021-01-05 15:43:56 UTC

Verified with:
ovirt-engine-4.4.4.6-0.1.el8ev.noarch

Steps:
As per steps in comment #13

Results:
1. The VM changes to RebootInProcess state first, then goes to Down status
2. The Down VM could be started manually

The bug fix works, the VM is not stuck in RebootInProcess state, it's operational.
As to the issue that the VM goes to Down status but not start again, there is another bug tracking it, see bug 1912911.

Comment 15 Sandro Bonazzola 2021-01-12 16:23:46 UTC

This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.