Bug 2027959

Summary: [RFE] virt-launcher pod of Windows VM stuck in terminating state, no button in the UI to force power off
Product: Container Native Virtualization (CNV) Reporter: Germano Veit Michel <gveitmic>
Component: User ExperienceAssignee: Ugo Palatucci <upalatuc>
Status: CLOSED ERRATA QA Contact: Guohua Ouyang <gouyang>
Severity: high Docs Contact:
Priority: high    
Version: 4.10.0CC: acardace, alitke, aos-bugs, fdeutsch, gouyang, mschatzm, phbailey, sgott, tnisan, upalatuc, ycui, yfrimanm, yzamir
Target Milestone: ---Keywords: Reopened
Target Release: 4.14.0Flags: gouyang: needinfo-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-01 14:51:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2182333    
Bug Blocks:    
Attachments:
Description Flags
The text in the popover should be reviewed by the docs team
none
The text in the popover should be reviewed by the docs team
none
Updated
none
No need for the units dropdown none

Description Germano Veit Michel 2021-12-01 07:06:40 UTC
Description of problem:

It seems we have a very high terminationGracePeriodSeconds on Windows VMs templates.

For example:

# oc get template windows10-desktop-medium -n openshift -o yaml | grep -i grace
        terminationGracePeriodSeconds: 3600

# oc get template windows2k19-highperformance-large -n openshift -o yaml | grep -i grace
        terminationGracePeriodSeconds: 3600

While RHEL has a more reasonable 180:

# oc get template rhel8-server-medium -n openshift -o yaml | grep -i grace
        terminationGracePeriodSeconds: 180

This is annoying if the Virtual Machine fails to shutdown gracefully. The user needs to interneve manually or wait 1h.

Version-Release number of selected component (if applicable):
4.9.0

How reproducible:
100%

Steps to Reproduce:
1. Create Windows Virtual Machine using some random bytes or corrupted disk file, or any unbootable image, so it gets stuck on SeaBIOS and won't handle shutdown signal.
2. Start VM
3. Stop or Delete the VM

Actual results:
* Pod stuck in terminating for 1h

Expected results:
* More reasonable timeout, so user does not need to intervene.

Comment 1 sgott 2021-12-01 13:21:12 UTC
The default for Windows is deliberately set to be a high value. The rationale behind this is that Windows tends to apply updates during shutdown. Setting a lower value by default can lead to a situation where data corruption can occur.

If a user prefers a lower value, a custom template can be created.

Additionally it's possible to apply the --grace-period argument when attempting to delete a resource.

I'm closing this BZ based on the above. If you think this is in error, please feel free to re-open.

Comment 2 Germano Veit Michel 2021-12-01 21:59:05 UTC
CNV could try to do something a little bit more elaborate, like RHV does.

If the VM is not considered Up (any OS), it will kill the VM immediately, without waiting.

The question is then what is an Up VM:

    private boolean canShutdownVm() {
        return (getVm().getStatus() == VMStatus.Up
                || getVm().getStatus().isMigrating() && getVmManager().getLastStatusBeforeMigration() == VMStatus.Up)
                && (Boolean.TRUE.equals(getVm().getAcpiEnable()) || getVm().getHasAgent());
    }

After starting a VM, it will be 'Up' once the Guest agent connects, or it reaches a timeout - GUEST_WAIT_TIMEOUT - for VMs without Guest agent.

So if it fails to boot, I can easily and quickly power it off again via the UI without having to wait 1h or run manual commands, or eventually have a customer to contact support just to shutdown a VM that is stuck trying to power off for 1h.

Comment 3 Germano Veit Michel 2021-12-20 00:47:07 UTC
Stuart, sorry to re-open this but its highly annoying.

I've just hit it again on a failed windows install, try to shutdown it during Windows install from ISO (I forgot to set a specific setting I wanted) hits the same thing as the OS won't shutdown by itself.

Also, support frequently helps customers when some upgrade makes the Guest fail to boot (both Windows and Linux) and gets stuck on boot loader or something similar: the guest must be powered off and a CD used for recovery. Waiting 1h for windows is not nice and even the 3m for linux is not great.

Could you please take a look at comment #2? Or maybe add another button on the console plugin to "Power Off".

Thank you!

Comment 4 sgott 2021-12-20 13:53:09 UTC
Tal,

Given that there exists a mechanism to shut down immediately from the CLI, is this request something the UI could incorporate? Per comment #3?

Comment 5 sgott 2021-12-21 13:09:30 UTC
Per the conversation so far, it appears the request here is to create a UI workflow, thus re-assigning to the UX component. Please feel free to re-assign if this action appears to be in error.

Comment 6 Yifat Menchik 2022-01-18 17:08:35 UTC
Created attachment 1851667 [details]
The text in the popover should be reviewed by the docs team

Comment 7 Yifat Menchik 2022-01-18 17:09:53 UTC
Created attachment 1851668 [details]
The text in the popover should be reviewed by the docs team

Comment 8 Yifat Menchik 2022-01-19 08:57:26 UTC
Created attachment 1851815 [details]
Updated

Comment 9 Yifat Menchik 2022-01-19 08:58:06 UTC
Created attachment 1851816 [details]
No need for the units dropdown

Comment 11 Yifat Menchik 2022-01-25 09:05:27 UTC
@sgott We would like to implement "force delete" action for a terminating VM and wondered is it possible to send a new delete (api call) to a VM that is stuck in the delete process.
Thanks.

Comment 12 Guohua Ouyang 2022-01-25 10:09:13 UTC
Please note it's not about VM is stucking in "Terminating", it's the POD and PVC stucking in "Terminating".
Tested the posted PR, still see the same thing, pod and PVC is staying in "Terminating".

Comment 13 sgott 2022-01-25 13:12:00 UTC
Hi Yifat, we're tracking the request in Comment #11 here:

https://bugzilla.redhat.com/show_bug.cgi?id=2040766

The basic issue here is that the API rejects deletion requests if a deltion has already been requested. In the case that the user is attempting to force deletion of a VM stuck in terminating, this should be honored.

Comment 14 Yifat Menchik 2022-01-25 14:08:53 UTC
Thank you Stu. As the ui depends on the solution for this bug https://bugzilla.redhat.com/show_bug.cgi?id=2040766, we would like to know when is it targeted for?

Comment 15 Guohua Ouyang 2022-01-26 11:13:11 UTC
Just tested that if the pod is stucking in "terminating", delete the virt-launcher pod from CLI also does not work.

$ oc delete pod virt-launcher-win10-israeli-lamprey-m2z8g --grace-period=30
The above command does not delete the pod after 30s.

So this is a virt bug as '--grace-period' is not working.

Reproduce steps:
1. create a windows VM from UI, check "This is a CD-ROM boot source" on the boot source step
2. wait until the VM is running
3. delete the VM
4. VM is deleted, but the virt-launcher pod is stucking in "terminating"
5. delete the pod from CLI by specify the grace period
6. pod is not deleted after the grace period

Comment 16 Yaacov Zamir 2022-01-26 12:54:24 UTC
moving to new, because we need a fix from virt team

Comment 17 sgott 2022-01-26 14:42:24 UTC
I'm confused by the re-assignment of this BZ. We'll be fixing the --grace-period issue here https://bugzilla.redhat.com/show_bug.cgi?id=2040766

If this BZ is now being considered the same thing (per comment #15), then is it a duplicate?

Comment 18 Yaacov Zamir 2022-01-26 22:39:28 UTC
Hi

So we have 2 issues the prevented us from fixing this on the UI side:
a - a VM that is forcefully killed (e.g. --grace-period is reached) does not clean up it's resources (PVC and Pods).
b - once a delete VM is sent the first time, a user can't send a force kill even if the VM is stuck while trying to delete.

AFAIU:
a -
This bug is open the track cleaning up after a force kill of the VM:
Actual results:
Pods and PVC persist after VM is forcefully shut down. 
Expected result:
Pods and PVC are garbage collected after VM is forcefully shut down.

b-
We also have another bug to track sending a "please adjust grace-period" to a VM that is currently being shut down.

c-
I thinkg https://bugzilla.redhat.com/show_bug.cgi?id=2040766 is related but a slightly different issue.

Comment 19 sgott 2022-02-02 13:14:56 UTC
Yaacov,

You mentioned that pods do not clean up their resources. We expect they do. Do you have a concrete example you can provide that shows PVC/Pods sticking around after a VM is forcefully killed?

Comment 20 sgott 2022-02-02 13:19:17 UTC
Prita,

We suspect this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2040766 so assigning it to you. I expect that this will likely be closed as a duplicate or marked as TestOnly. The only outstanding question is per Comment #19

Comment 21 Yaacov Zamir 2022-02-02 18:12:48 UTC
hi, 
I never tested it myself, I was referring to step 6 in https://bugzilla.redhat.com/show_bug.cgi?id=2027959#c15 

```
Reproduce steps:
1. create a windows VM from UI, check "This is a CD-ROM boot source" on the boot source step
2. wait until the VM is running
3. delete the VM
4. VM is deleted, but the virt-launcher pod is stucking in "terminating"
5. delete the pod from CLI by specify the grace period
6. pod is not deleted after the grace period

```

Comment 22 Germano Veit Michel 2022-02-14 23:44:51 UTC
As a result of this BZ or any closed as a DUP I'd like to have at least an easy (via UI) way for users to power off VMs regardless of their state, or ideally let kubevirt handle these VMs automatically by powering off immediately if there is no OS running or frozen, without waiting for huge timeouts.

Comment 25 Antonio Cardace 2022-04-29 15:53:59 UTC
Per comment #22 moving this to the UI component.

Comment 28 Fabian Deutsch 2022-07-13 08:46:17 UTC
@tnisan @sgott does this depend on any backend functionality? If it does, where does it stand?

Comment 30 sgott 2022-10-27 17:56:55 UTC
Backend functionality was provided in https://github.com/kubevirt/kubevirt/pull/7494

This behavior is present in 4.11.0 and newer. KubeVirt now accepts subsequent stop requests with a shorter termination grace period seconds.

Comment 31 Guohua Ouyang 2022-11-16 07:35:08 UTC
Hi Ugo,
I think this option should be added to "Stop" as well, not only on "Delete", what do you think?

Comment 32 Ugo Palatucci 2022-11-16 08:45:20 UTC
Yes, good idea :-) 
We don't have a stop modal. I'll create it.

Comment 35 Guohua Ouyang 2023-03-09 01:25:29 UTC
The issue is still existing in 4.13, steps to reproduce:
1. create a windows vm from http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/windows-images/raw_images/win11_validationos_amd64.iso
2. wait for the VM turn into running
3. delete the VM and set the grace period to 60 seconds in the delete modal
4. the VM is not deleted after 60 seconds

Comment 36 Ugo Palatucci 2023-03-29 13:49:50 UTC
@gouyang Seems like the request for delete with gracePeriodSeconds is done fine. Maybe backend can reproduce.

{
    "kind": "DeleteOptions",
    "apiVersion": "v1",
    "gracePeriodSeconds": 10
}

Comment 37 Guohua Ouyang 2023-03-29 14:23:54 UTC
The bug https://bugzilla.redhat.com/show_bug.cgi?id=2182333 is for backend, if the backend does not fix it, it make no sense to add the option on UI as it won't work.

If the backend issue can be fixed, then we need to add the functionality for Stop action.

Comment 38 Ugo Palatucci 2023-03-30 09:39:33 UTC
@gouyang do we need to add the stop action grace period functionality on this bug?

Comment 39 Guohua Ouyang 2023-03-30 10:34:46 UTC
(In reply to Ugo Palatucci from comment #38)
> @gouyang do we need to add the stop action grace period
> functionality on this bug?

Let's adding it anyway, aligns with the delete modal.

Comment 40 Ugo Palatucci 2023-04-11 14:37:41 UTC
Hei @gouyang We need to investigate a little bit more to add the functionality for the Stop modal.

This is because the parameters for the Delete are well defined in DeleteOptions, but the stop has no parameters.

What we are thinking is deleting the VMI using the grace period defined in the DeleteOptions, but I have to discuss it a little bit with someone from the backend team and see if it's a good strategy.

Comment 41 Ugo Palatucci 2023-04-12 09:12:17 UTC
i was wrong. Stop has parameters and i've just created a pr to have grace period on stop

Comment 43 Guohua Ouyang 2023-04-24 00:55:59 UTC
verified on kubevirt-console-plugin-rhel9-container-v4.14.0-1011

Comment 51 errata-xmlrpc 2023-08-01 14:51:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:4421