2040766 – A crashed Windows VM cannot be restarted with virtctl or the UI

Bug 2040766 - A crashed Windows VM cannot be restarted with virtctl or the UI

Summary: A crashed Windows VM cannot be restarted with virtctl or the UI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.8.8
Hardware:	All
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Prita Narayan
QA Contact:	zhe peng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-14 16:29 UTC by pmoses
Modified:	2023-11-13 08:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.11.0-491
Doc Type:	Known Issue
Doc Text:	KubeVirt prevents a VM stop request from being processed multiple times. As a consequence, if a VM hangs during shutdown, then it is not possible to issue a new request for immediate shutdown, for example, by using the "--force --grace-period 0" flags. A VM stuck in terminating state cannot be easily stopped from the UI. However, it is possible to directly delete the virt-launcher pod.
Clone Of:
Environment:
Last Closed:	2022-09-14 19:28:30 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
launcher pod log (152.68 KB, text/plain) 2022-01-18 13:48 UTC, pmoses	no flags	Details
UI details (234.02 KB, image/png) 2022-01-18 13:56 UTC, pmoses	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 7494	None	open	VM with RunStrategyHalted now accepts manual stop request...	2022-05-19 12:45:50 UTC
Github	kubevirt kubevirt pull 7860	None	open	[release-0.53] VM with RunStrategyHalted now accepts manual stop request...	2022-06-07 11:42:20 UTC
Github	openshift openshift-docs pull 42530	None	open	CNV13829: Adding Comprehensive 4.10 Release Notes	2022-03-15 22:21:04 UTC
Red Hat Issue Tracker	CNV-15848	None	None	None	2023-11-13 08:16:38 UTC
Red Hat Knowledge Base (Solution)	6740601	None	None	None	2022-02-17 01:40:49 UTC
Red Hat Product Errata	RHSA-2022:6526	None	None	None	2022-09-14 19:28:56 UTC

Description pmoses 2022-01-14 16:29:54 UTC

Description of problem:
If a Windows VM crashes or becomes unresponsive, before a host agent is responding, there is not an apparent way to stop the VM. virtctl will respond with "halted does not support manual restart requests"


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Blue screen a Windows VM
2. Attempt to stop VM
3. VM stays up (can view in the console), neither the UI nor virtctl will properly halt the machine. 

Actual results:
A manageable way for end-users to restart Windows VMs that are crashed without the  host agent reporting back to the platform. 


Expected results:
A manual/force power off of VM without deleting it. 

Additional info:

Comment 1 sgott 2022-01-17 21:22:38 UTC

There exists flags for virtctl (--grace-period 0 --force) that should halt the machine. Did you try that?

Comment 2 pmoses 2022-01-18 13:48:10 UTC

Created attachment 1851600 [details]
launcher pod log

Comment 3 pmoses 2022-01-18 13:56:47 UTC

Created attachment 1851601 [details]
UI details

Comment 4 pmoses 2022-01-18 13:59:03 UTC

Yes. It seems the flag of force and grace-period are only valid with restart. Either way, the results are the same:

[pmo@pmo-rhel ~]$ virtctl version
Client Version: version.Info{GitVersion:"v0.30.7", GitCommit:"af8ac92fbb1fc4c1c4fda6a2d6ddb04eaded797e", GitTreeState:"clean", BuildDate:"2021-06-07T10:07:04Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

[pmo@pmo-rhel ~]$ virtctl restart win10 --force --grace-period=0
Error restarting VirtualMachine, Operation cannot be fulfilled on virtualmachine.kubevirt.io "win10": Halted does not support manual restart requests

[pmo@pmo-rhel ~]$ virtctl stop win10 --grace-period=0 --force
unknown flag: --grace-period

[pmo@pmo-rhel ~]$ virtctl stop win10
Error stopping VirtualMachine Operation cannot be fulfilled on virtualmachine.kubevirt.io "win10": Halted does not support manual stop requests

Comment 5 sgott 2022-01-19 13:38:36 UTC

Raising the severity of this because it's hard to avoid once it's been triggered. It can be done but that requires deleting the pod.

The real bug here is that KubeVirt should honor a second halt request if the user issues a newer shorter timeout.

Comment 7 Germano Veit Michel 2022-02-17 03:41:18 UTC

(In reply to sgott from comment #5)
> The real bug here is that KubeVirt should honor a second halt request if the
> user issues a newer shorter timeout.

One interesting thing: if the VM is stuck on boot (i.e. pause on SeaBIOS), the second halt request returns the same error in the CLI, but the VM is actually shutdown immediatly.
This is on 4.9.21 with 4.9.2, windows vm.

Unfortunately deleting the virt-launcher pod does not work, the pod is gone but the VMI is still there.

# oc get vmi
NAME                    AGE   PHASE     IP            NODENAME                          READY
win2k16-happy-pelican   11m   Running   10.129.2.37   worker-1.lab-cluster.toca.local   False
# oc get pods | grep virt-launcher
#

That vmi stays there, not cleaning up. Force deleting it does not work too, hangs forever without doing anything.

# oc delete vmi win2k16-happy-pelican --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
virtualmachineinstance.kubevirt.io "win2k16-happy-pelican" force deleted
^C

The only thing I can find that really works and makes the cleanup happen is to finish that job that was initially started: kill qemu process on the node.

Comment 9 ctomasko 2022-03-15 22:21:04 UTC

Added Release note > known issue

You cannot attempt to stop a VM multiple times because KubeVirt prevents multiple stop attempts. If a VM crashes during shutdown, then you cannot issue a new stop attempt and you cannot easily remove the VM from the UI. (BZ#2040766)

https://github.com/openshift/openshift-docs/pull/42530
https://deploy-preview-42530--osdocs.netlify.app/openshift-enterprise/latest/virt/virt-4-10-release-notes#virt-4-10-known-issues

Future link: After the OpenShift Virtualization 4.10 releases, you can find the release notes here: https://docs.openshift.com/container-platform/4.10/virt/virt-4-10-release-notes.html
or on the portal,
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10

Comment 10 mykarein 2022-05-15 13:47:48 UTC Comment hidden (spam)

This comment was flagged a spam, view the edit history to see the original text if required.

Comment 13 zhe peng 2022-06-28 07:51:33 UTC

verify with build:
Server Version: 4.11.0-fc.3
$ virtctl version
Client Version: version.Info{GitVersion:"v0.53.2-16-gd3854bb91", GitCommit:"d3854bb91a447946d3ef626f243e001c4766d5a4", GitTreeState:"clean", BuildDate:"2022-06-19T10:27:57Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.53.2-37-gd8a6ac7e7", GitCommit:"d8a6ac7e78042ed77d99601fce197cae58d16f5a", GitTreeState:"clean", BuildDate:"2022-06-26T10:19:51Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}

step:
1. create a windows vm
2. start vm, within vm, run cmd "TASKKILL /IM svchost.exe /F" to trigger a windows BSoD
3. use vitctl to stop or restart vm
stop-1:
$ virtctl stop vm-win10 --grace-period=0 --force
VM vm-win10 was scheduled to stop
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   31m   Stopped   False
stop-2:
$ virtctl stop vm-win10
VM vm-win10 was scheduled to stop
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   33m   Stopped   False
restart:
$ virtctl restart vm-win10 --force --grace-period=0
VM vm-win10 was scheduled to restart
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   27m   Running   True

also test vm with RunStrategy setting
test "Manual" and "Halted", worked as expect. 
move to verified.

Comment 15 errata-xmlrpc 2022-09-14 19:28:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526

Note You need to log in before you can comment on or make changes to this bug.