Bug 2040766 - A crashed Windows VM cannot be restarted with virtctl or the UI
Summary: A crashed Windows VM cannot be restarted with virtctl or the UI
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.8.8
Hardware: All
OS: Unspecified
high
medium
Target Milestone: ---
: 4.11.0
Assignee: Prita Narayan
QA Contact: zhe peng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-14 16:29 UTC by pmoses
Modified: 2023-11-13 08:16 UTC (History)
8 users (show)

Fixed In Version: hco-bundle-registry-container-v4.11.0-491
Doc Type: Known Issue
Doc Text:
KubeVirt prevents a VM stop request from being processed multiple times. As a consequence, if a VM hangs during shutdown, then it is not possible to issue a new request for immediate shutdown, for example, by using the "--force --grace-period 0" flags. A VM stuck in terminating state cannot be easily stopped from the UI. However, it is possible to directly delete the virt-launcher pod.
Clone Of:
Environment:
Last Closed: 2022-09-14 19:28:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
launcher pod log (152.68 KB, text/plain)
2022-01-18 13:48 UTC, pmoses
no flags Details
UI details (234.02 KB, image/png)
2022-01-18 13:56 UTC, pmoses
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 7494 0 None open VM with RunStrategyHalted now accepts manual stop request... 2022-05-19 12:45:50 UTC
Github kubevirt kubevirt pull 7860 0 None open [release-0.53] VM with RunStrategyHalted now accepts manual stop request... 2022-06-07 11:42:20 UTC
Github openshift openshift-docs pull 42530 0 None open CNV13829: Adding Comprehensive 4.10 Release Notes 2022-03-15 22:21:04 UTC
Red Hat Issue Tracker CNV-15848 0 None None None 2023-11-13 08:16:38 UTC
Red Hat Knowledge Base (Solution) 6740601 0 None None None 2022-02-17 01:40:49 UTC
Red Hat Product Errata RHSA-2022:6526 0 None None None 2022-09-14 19:28:56 UTC

Description pmoses 2022-01-14 16:29:54 UTC
Description of problem:
If a Windows VM crashes or becomes unresponsive, before a host agent is responding, there is not an apparent way to stop the VM. virtctl will respond with "halted does not support manual restart requests"


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Blue screen a Windows VM
2. Attempt to stop VM
3. VM stays up (can view in the console), neither the UI nor virtctl will properly halt the machine. 

Actual results:
A manageable way for end-users to restart Windows VMs that are crashed without the  host agent reporting back to the platform. 


Expected results:
A manual/force power off of VM without deleting it. 

Additional info:

Comment 1 sgott 2022-01-17 21:22:38 UTC
There exists flags for virtctl (--grace-period 0 --force) that should halt the machine. Did you try that?

Comment 2 pmoses 2022-01-18 13:48:10 UTC
Created attachment 1851600 [details]
launcher pod log

Comment 3 pmoses 2022-01-18 13:56:47 UTC
Created attachment 1851601 [details]
UI details

Comment 4 pmoses 2022-01-18 13:59:03 UTC
Yes. It seems the flag of force and grace-period are only valid with restart. Either way, the results are the same:

[pmo@pmo-rhel ~]$ virtctl version
Client Version: version.Info{GitVersion:"v0.30.7", GitCommit:"af8ac92fbb1fc4c1c4fda6a2d6ddb04eaded797e", GitTreeState:"clean", BuildDate:"2021-06-07T10:07:04Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

[pmo@pmo-rhel ~]$ virtctl restart win10 --force --grace-period=0
Error restarting VirtualMachine, Operation cannot be fulfilled on virtualmachine.kubevirt.io "win10": Halted does not support manual restart requests

[pmo@pmo-rhel ~]$ virtctl stop win10 --grace-period=0 --force
unknown flag: --grace-period

[pmo@pmo-rhel ~]$ virtctl stop win10
Error stopping VirtualMachine Operation cannot be fulfilled on virtualmachine.kubevirt.io "win10": Halted does not support manual stop requests

Comment 5 sgott 2022-01-19 13:38:36 UTC
Raising the severity of this because it's hard to avoid once it's been triggered. It can be done but that requires deleting the pod.

The real bug here is that KubeVirt should honor a second halt request if the user issues a newer shorter timeout.

Comment 7 Germano Veit Michel 2022-02-17 03:41:18 UTC
(In reply to sgott from comment #5)
> The real bug here is that KubeVirt should honor a second halt request if the
> user issues a newer shorter timeout.

One interesting thing: if the VM is stuck on boot (i.e. pause on SeaBIOS), the second halt request returns the same error in the CLI, but the VM is actually shutdown immediatly.
This is on 4.9.21 with 4.9.2, windows vm.

Unfortunately deleting the virt-launcher pod does not work, the pod is gone but the VMI is still there.

# oc get vmi
NAME                    AGE   PHASE     IP            NODENAME                          READY
win2k16-happy-pelican   11m   Running   10.129.2.37   worker-1.lab-cluster.toca.local   False
# oc get pods | grep virt-launcher
#

That vmi stays there, not cleaning up. Force deleting it does not work too, hangs forever without doing anything.

# oc delete vmi win2k16-happy-pelican --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
virtualmachineinstance.kubevirt.io "win2k16-happy-pelican" force deleted
^C

The only thing I can find that really works and makes the cleanup happen is to finish that job that was initially started: kill qemu process on the node.

Comment 9 ctomasko 2022-03-15 22:21:04 UTC
Added Release note > known issue

You cannot attempt to stop a VM multiple times because KubeVirt prevents multiple stop attempts. If a VM crashes during shutdown, then you cannot issue a new stop attempt and you cannot easily remove the VM from the UI. (BZ#2040766)

https://github.com/openshift/openshift-docs/pull/42530
https://deploy-preview-42530--osdocs.netlify.app/openshift-enterprise/latest/virt/virt-4-10-release-notes#virt-4-10-known-issues

Future link: After the OpenShift Virtualization 4.10 releases, you can find the release notes here: https://docs.openshift.com/container-platform/4.10/virt/virt-4-10-release-notes.html
or on the portal,
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10

Comment 10 mykarein 2022-05-15 13:47:48 UTC Comment hidden (spam)
Comment 13 zhe peng 2022-06-28 07:51:33 UTC
verify with build:
Server Version: 4.11.0-fc.3
$ virtctl version
Client Version: version.Info{GitVersion:"v0.53.2-16-gd3854bb91", GitCommit:"d3854bb91a447946d3ef626f243e001c4766d5a4", GitTreeState:"clean", BuildDate:"2022-06-19T10:27:57Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.53.2-37-gd8a6ac7e7", GitCommit:"d8a6ac7e78042ed77d99601fce197cae58d16f5a", GitTreeState:"clean", BuildDate:"2022-06-26T10:19:51Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}

step:
1. create a windows vm
2. start vm, within vm, run cmd "TASKKILL /IM svchost.exe /F" to trigger a windows BSoD
3. use vitctl to stop or restart vm
stop-1:
$ virtctl stop vm-win10 --grace-period=0 --force
VM vm-win10 was scheduled to stop
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   31m   Stopped   False
stop-2:
$ virtctl stop vm-win10
VM vm-win10 was scheduled to stop
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   33m   Stopped   False
restart:
$ virtctl restart vm-win10 --force --grace-period=0
VM vm-win10 was scheduled to restart
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   27m   Running   True

also test vm with RunStrategy setting
test "Manual" and "Halted", worked as expect. 
move to verified.

Comment 15 errata-xmlrpc 2022-09-14 19:28:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526


Note You need to log in before you can comment on or make changes to this bug.