Bug 1933043 - Delete VM just after it turns into "running" is very likely to hit grace period end
Summary: Delete VM just after it turns into "running" is very likely to hit grace peri...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.6.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.9.0
Assignee: sgott
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-25 14:47 UTC by Guohua Ouyang
Modified: 2021-11-02 15:57 UTC (History)
6 users (show)

Fixed In Version: virt-operator-container-v4.9.0-25 hco-bundle-registry-container-v4.9.0-89
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-02 15:57:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
example vm yaml (6.05 KB, text/plain)
2021-02-26 00:32 UTC, Guohua Ouyang
no flags Details
virt-launcher.log (53.53 KB, text/plain)
2021-03-11 09:13 UTC, Guohua Ouyang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 5691/files 0 None None None 2021-05-21 11:00:28 UTC
Red Hat Product Errata RHSA-2021:4104 0 None None None 2021-11-02 15:57:41 UTC

Description Guohua Ouyang 2021-02-25 14:47:18 UTC
Description of problem:
Delete VM just after it turns into "running" is very likely to hit grace period timeout.

Version-Release number of selected component (if applicable):
CNV 2.6.0

How reproducible:
very frequently

Steps to Reproduce:
1. Create a VM
2. Delete it just after it become "Running"
3. The VM is not deleted after the terminationGracePeriodSeconds is end, for windows, it's 3600s.

Actual results:
VM is deleted after terminationGracePeriodSeconds.

Expected results:
VM can be deleted very quickly.

Additional info:

Comment 1 Igor Bezukh 2021-02-25 15:40:03 UTC
Sorry I can' understand the issue here. Does the termination takes longer then the graceful termination period?
How would you expect the VM to be deleted quickly while it has termination period configured on it? 

TIA
Igor

Comment 2 sgott 2021-02-25 18:31:47 UTC
Please provide a manifest or example VM that triggers this behavior. Without any further details, this BZ will be very hard to isolate/diagnose.

Comment 3 Guohua Ouyang 2021-02-26 00:32:16 UTC
Created attachment 1759399 [details]
example vm yaml

Comment 4 Guohua Ouyang 2021-02-26 00:44:26 UTC
(In reply to Igor Bezukh from comment #1)
> Sorry I can' understand the issue here. Does the termination takes longer
> then the graceful termination period?

No, the VM is deleted when it reaches the terminationGracePeriodSeconds.

> How would you expect the VM to be deleted quickly while it has termination
> period configured on it? 

If the VM is full up, like VM login prompt shows up, delete the VM only takes a few seconds.
If delete the VM just after it turns running, it takes very long time, for windows VM, it takes an hour to get the VM deleted. 

Every VM created from the common template has the termination period configured, does it make sense to take an hour to delete a running VM?

> 
> TIA
> Igor

Comment 5 sgott 2021-02-28 21:56:12 UTC
I'm leaning toward this being expected behavior. In order for a guest to be shut down gracefully, ACPI events are used. In this workflow, the user deliberately attempts to shut the guest down before it's even running (read: won't/can't respond to ACPI events). Because the guest can't respond to ACPI when the signal is sent--it will time out. This is exactly why a timeout is implemented--in case something went wrong while trying to shut the guest down.

Lowering the severity to low as there's no danger of data loss and this doesn't really sound like a realistic workflow most users will engage in. It's unfortunate that the timeout is a full hour for Windows guests, but the system is literally doing exactly what we asked it to--wait an entire hour before killing the guest. Because Windows applies updates during shutdown, it's not really a good (default) strategy to lower this timeout by the way.

Comment 6 Igor Bezukh 2021-03-02 15:14:55 UTC
(In reply to Guohua Ouyang from comment #4)

Thank you for the clarification

> (In reply to Igor Bezukh from comment #1)
> > Sorry I can' understand the issue here. Does the termination takes longer
> > then the graceful termination period?
> 
> No, the VM is deleted when it reaches the terminationGracePeriodSeconds.
> 

Got you.

> > How would you expect the VM to be deleted quickly while it has termination
> > period configured on it? 
> 
> If the VM is full up, like VM login prompt shows up, delete the VM only
> takes a few seconds.
> If delete the VM just after it turns running, it takes very long time, for
> windows VM, it takes an hour to get the VM deleted. 
> 
> Every VM created from the common template has the termination period
> configured, does it make sense to take an hour to delete a running VM?

I can give you a simple example - you can create a VM without bootable media and run it.
What will happen is that you will see the running status on the VM, but when you try to delete it or trigger the virtctl shutdown command
you will always hit the graceful timeout period. When there is no one to catch the ACPI shutdown event, you will always hit the graceful period.
So if you want us to support a force shutdown command, please open RFE, otherwise I can't see a bug in the scenario you've described here. 

> 
> > 
> > TIA
> > Igor

Comment 7 Ruth Netser 2021-03-02 16:57:04 UTC
@Guohua in your flow, does the VM end up running and you can login via console?
If so, did you see shutdown retries - "Signaled graceful shutdown for <vm name>" in virt-handler logs?

Comment 8 Guohua Ouyang 2021-03-03 01:22:19 UTC
(In reply to Ruth Netser from comment #7)
> @Guohua in your flow, does the VM end up running and you can login via
> console?
The VMI end up in running state and VNC/Serial console is available for login.

> If so, did you see shutdown retries - "Signaled graceful shutdown for <vm
> name>" in virt-hadler logs?
Yes, It keeps reporting this message in virt-handler logs.
"{"component":"virt-handler","kind":"","level":"info","msg":"Signaled graceful shutdown for win10-sore-sheep","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1781","timestamp":"2021-03-03T01:05:18.933466Z","uid":"312d804c-6e16-4df6-b9bc-c3a3897c6181"}
"

Comment 9 Guohua Ouyang 2021-03-03 01:26:06 UTC
it keeps loop below message in virt-handler logs for one hour.

{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1307","timestamp":"2021-03-03T01:18:29.887671Z","uid":"312d804c-6e16-4df6-b9bc-c3a3897c6181"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1312","timestamp":"2021-03-03T01:18:29.887716Z","uid":""}
{"component":"virt-handler","kind":"","level":"info","msg":"Signaled graceful shutdown for win10-sore-sheep","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1781","timestamp":"2021-03-03T01:18:29.891856Z","uid":"312d804c-6e16-4df6-b9bc-c3a3897c6181"}
{"component":"virt-handler","level":"info","msg":"Processing event default/win10-sore-sheep","pos":"vm.go:1305","timestamp":"2021-03-03T01:18:34.892799Z"}

Comment 10 Ruth Netser 2021-03-03 07:35:29 UTC
@Igor, based on Guohua's infomration, I would expect the gyest to shut down once it has booted and received the ACPI event.
Could it be that in this specific flow something causes the event not to reach the OS?

Comment 11 Igor Bezukh 2021-03-11 08:18:56 UTC
Thank you for the elaboration of the issue. 
@Guohua is there a chance you can reproduce this issue, but this time add a label to the VM resource called "DebugLogs: true" and when it happens can you please s take virt-launcher logs?

Thank you!

Comment 12 Guohua Ouyang 2021-03-11 09:13:13 UTC
Created attachment 1762568 [details]
virt-launcher.log

Reproduced and provides virt-launcher logs

Comment 13 aschuett 2021-06-18 08:24:56 UTC
I think this bug is solved by the two PRs that also solved https://bugzilla.redhat.com/show_bug.cgi?id=1900631 

We have changed it so that multiple calls to graceful shutdown can take place, as well as adding a `virtctl stop --force --grace-period 0`

Comment 14 sgott 2021-07-12 17:52:18 UTC
Targetting this to the next release as we believe this is now fixed. Per Comment #13.

Comment 15 sgott 2021-08-10 14:24:11 UTC
To verify:

Steps to Reproduce:
1. Create a VM
2. Delete it just after it become "Running"
3. VM should be deleted after terminationGracePeriodSeconds.

Comment 16 zhe peng 2021-08-30 05:39:01 UTC
verify with build:
HCO:[v4.9.0-144]

step :
1. create vm
2. start vm 
3. stop vm just after it become "Running"
$virtctl stop vm-win10 --force --grace-period=0
VM vm-win10 was scheduled to stop
4. vm stopped after terminationGracePeriodSeconds
wait 1 sec, the vm change to Stopped, and the pod is removed.
$oc get vm
vm-win10    3h29m   Stopped   False

test both rhel and windows vm. 
also test delete vm

$ oc delete vm vm-win10 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
virtualmachine.kubevirt.io "vm-win10" force deleted

check the vm, is been removed.

move to verified.

Comment 20 errata-xmlrpc 2021-11-02 15:57:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104


Note You need to log in before you can comment on or make changes to this bug.