Bug 2040766

Summary:

A crashed Windows VM cannot be restarted with virtctl or the UI

Product:

Container Native Virtualization (CNV)

Reporter:

pmoses

Component:

Virtualization

Assignee:

Prita Narayan <prnaraya>

Status:

CLOSED ERRATA

QA Contact:

zhe peng <zpeng>

Severity:

medium

Docs Contact:

Priority:

high

Version:

4.8.8

CC:

acardace, cnv-qe-bugs, ctomasko, fdeutsch, gveitmic, kbidarka, sgott, ycui

Target Milestone:

---

Target Release:

4.11.0

Hardware:

All

OS:

Unspecified

Whiteboard:

Fixed In Version:

hco-bundle-registry-container-v4.11.0-491

Doc Type:

Known Issue

Doc Text:

KubeVirt prevents a VM stop request from being processed multiple times. As a consequence, if a VM hangs during shutdown, then it is not possible to issue a new request for immediate shutdown, for example, by using the "--force --grace-period 0" flags. A VM stuck in terminating state cannot be easily stopped from the UI. However, it is possible to directly delete the virt-launcher pod.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-09-14 19:28:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
launcher pod log	none
UI details	none

Description pmoses 2022-01-14 16:29:54 UTC

Description of problem:
If a Windows VM crashes or becomes unresponsive, before a host agent is responding, there is not an apparent way to stop the VM. virtctl will respond with "halted does not support manual restart requests"


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Blue screen a Windows VM
2. Attempt to stop VM
3. VM stays up (can view in the console), neither the UI nor virtctl will properly halt the machine. 

Actual results:
A manageable way for end-users to restart Windows VMs that are crashed without the  host agent reporting back to the platform. 


Expected results:
A manual/force power off of VM without deleting it. 

Additional info:

Comment 1 sgott 2022-01-17 21:22:38 UTC

There exists flags for virtctl (--grace-period 0 --force) that should halt the machine. Did you try that?

Comment 2 pmoses 2022-01-18 13:48:10 UTC

Created attachment 1851600 [details]
launcher pod log

Comment 3 pmoses 2022-01-18 13:56:47 UTC

Created attachment 1851601 [details]
UI details

Comment 4 pmoses 2022-01-18 13:59:03 UTC

Yes. It seems the flag of force and grace-period are only valid with restart. Either way, the results are the same:

[pmo@pmo-rhel ~]$ virtctl version
Client Version: version.Info{GitVersion:"v0.30.7", GitCommit:"af8ac92fbb1fc4c1c4fda6a2d6ddb04eaded797e", GitTreeState:"clean", BuildDate:"2021-06-07T10:07:04Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

[pmo@pmo-rhel ~]$ virtctl restart win10 --force --grace-period=0
Error restarting VirtualMachine, Operation cannot be fulfilled on virtualmachine.kubevirt.io "win10": Halted does not support manual restart requests

[pmo@pmo-rhel ~]$ virtctl stop win10 --grace-period=0 --force
unknown flag: --grace-period

[pmo@pmo-rhel ~]$ virtctl stop win10
Error stopping VirtualMachine Operation cannot be fulfilled on virtualmachine.kubevirt.io "win10": Halted does not support manual stop requests

Comment 5 sgott 2022-01-19 13:38:36 UTC

Raising the severity of this because it's hard to avoid once it's been triggered. It can be done but that requires deleting the pod.

The real bug here is that KubeVirt should honor a second halt request if the user issues a newer shorter timeout.

Comment 7 Germano Veit Michel 2022-02-17 03:41:18 UTC

(In reply to sgott from comment #5)
> The real bug here is that KubeVirt should honor a second halt request if the
> user issues a newer shorter timeout.

One interesting thing: if the VM is stuck on boot (i.e. pause on SeaBIOS), the second halt request returns the same error in the CLI, but the VM is actually shutdown immediatly.
This is on 4.9.21 with 4.9.2, windows vm.

Unfortunately deleting the virt-launcher pod does not work, the pod is gone but the VMI is still there.

# oc get vmi
NAME                    AGE   PHASE     IP            NODENAME                          READY
win2k16-happy-pelican   11m   Running   10.129.2.37   worker-1.lab-cluster.toca.local   False
# oc get pods | grep virt-launcher
#

That vmi stays there, not cleaning up. Force deleting it does not work too, hangs forever without doing anything.

# oc delete vmi win2k16-happy-pelican --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
virtualmachineinstance.kubevirt.io "win2k16-happy-pelican" force deleted
^C

The only thing I can find that really works and makes the cleanup happen is to finish that job that was initially started: kill qemu process on the node.

Comment 9 ctomasko 2022-03-15 22:21:04 UTC

Added Release note > known issue

You cannot attempt to stop a VM multiple times because KubeVirt prevents multiple stop attempts. If a VM crashes during shutdown, then you cannot issue a new stop attempt and you cannot easily remove the VM from the UI. (BZ#2040766)

https://github.com/openshift/openshift-docs/pull/42530
https://deploy-preview-42530--osdocs.netlify.app/openshift-enterprise/latest/virt/virt-4-10-release-notes#virt-4-10-known-issues

Future link: After the OpenShift Virtualization 4.10 releases, you can find the release notes here: https://docs.openshift.com/container-platform/4.10/virt/virt-4-10-release-notes.html
or on the portal,
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10

Comment 10 mykarein 2022-05-15 13:47:48 UTC Comment hidden (spam)

This comment was flagged a spam, view the edit history to see the original text if required.

Comment 13 zhe peng 2022-06-28 07:51:33 UTC

verify with build:
Server Version: 4.11.0-fc.3
$ virtctl version
Client Version: version.Info{GitVersion:"v0.53.2-16-gd3854bb91", GitCommit:"d3854bb91a447946d3ef626f243e001c4766d5a4", GitTreeState:"clean", BuildDate:"2022-06-19T10:27:57Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.53.2-37-gd8a6ac7e7", GitCommit:"d8a6ac7e78042ed77d99601fce197cae58d16f5a", GitTreeState:"clean", BuildDate:"2022-06-26T10:19:51Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}

step:
1. create a windows vm
2. start vm, within vm, run cmd "TASKKILL /IM svchost.exe /F" to trigger a windows BSoD
3. use vitctl to stop or restart vm
stop-1:
$ virtctl stop vm-win10 --grace-period=0 --force
VM vm-win10 was scheduled to stop
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   31m   Stopped   False
stop-2:
$ virtctl stop vm-win10
VM vm-win10 was scheduled to stop
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   33m   Stopped   False
restart:
$ virtctl restart vm-win10 --force --grace-period=0
VM vm-win10 was scheduled to restart
$ oc get vm
NAME       AGE   STATUS    READY
vm-win10   27m   Running   True

also test vm with RunStrategy setting
test "Manual" and "Halted", worked as expect. 
move to verified.

Comment 15 errata-xmlrpc 2022-09-14 19:28:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526