Description of problem: Delete VM just after it turns into "running" is very likely to hit grace period timeout. Version-Release number of selected component (if applicable): CNV 2.6.0 How reproducible: very frequently Steps to Reproduce: 1. Create a VM 2. Delete it just after it become "Running" 3. The VM is not deleted after the terminationGracePeriodSeconds is end, for windows, it's 3600s. Actual results: VM is deleted after terminationGracePeriodSeconds. Expected results: VM can be deleted very quickly. Additional info:
Sorry I can' understand the issue here. Does the termination takes longer then the graceful termination period? How would you expect the VM to be deleted quickly while it has termination period configured on it? TIA Igor
Please provide a manifest or example VM that triggers this behavior. Without any further details, this BZ will be very hard to isolate/diagnose.
Created attachment 1759399 [details] example vm yaml
(In reply to Igor Bezukh from comment #1) > Sorry I can' understand the issue here. Does the termination takes longer > then the graceful termination period? No, the VM is deleted when it reaches the terminationGracePeriodSeconds. > How would you expect the VM to be deleted quickly while it has termination > period configured on it? If the VM is full up, like VM login prompt shows up, delete the VM only takes a few seconds. If delete the VM just after it turns running, it takes very long time, for windows VM, it takes an hour to get the VM deleted. Every VM created from the common template has the termination period configured, does it make sense to take an hour to delete a running VM? > > TIA > Igor
I'm leaning toward this being expected behavior. In order for a guest to be shut down gracefully, ACPI events are used. In this workflow, the user deliberately attempts to shut the guest down before it's even running (read: won't/can't respond to ACPI events). Because the guest can't respond to ACPI when the signal is sent--it will time out. This is exactly why a timeout is implemented--in case something went wrong while trying to shut the guest down. Lowering the severity to low as there's no danger of data loss and this doesn't really sound like a realistic workflow most users will engage in. It's unfortunate that the timeout is a full hour for Windows guests, but the system is literally doing exactly what we asked it to--wait an entire hour before killing the guest. Because Windows applies updates during shutdown, it's not really a good (default) strategy to lower this timeout by the way.
(In reply to Guohua Ouyang from comment #4) Thank you for the clarification > (In reply to Igor Bezukh from comment #1) > > Sorry I can' understand the issue here. Does the termination takes longer > > then the graceful termination period? > > No, the VM is deleted when it reaches the terminationGracePeriodSeconds. > Got you. > > How would you expect the VM to be deleted quickly while it has termination > > period configured on it? > > If the VM is full up, like VM login prompt shows up, delete the VM only > takes a few seconds. > If delete the VM just after it turns running, it takes very long time, for > windows VM, it takes an hour to get the VM deleted. > > Every VM created from the common template has the termination period > configured, does it make sense to take an hour to delete a running VM? I can give you a simple example - you can create a VM without bootable media and run it. What will happen is that you will see the running status on the VM, but when you try to delete it or trigger the virtctl shutdown command you will always hit the graceful timeout period. When there is no one to catch the ACPI shutdown event, you will always hit the graceful period. So if you want us to support a force shutdown command, please open RFE, otherwise I can't see a bug in the scenario you've described here. > > > > > TIA > > Igor
@Guohua in your flow, does the VM end up running and you can login via console? If so, did you see shutdown retries - "Signaled graceful shutdown for <vm name>" in virt-handler logs?
(In reply to Ruth Netser from comment #7) > @Guohua in your flow, does the VM end up running and you can login via > console? The VMI end up in running state and VNC/Serial console is available for login. > If so, did you see shutdown retries - "Signaled graceful shutdown for <vm > name>" in virt-hadler logs? Yes, It keeps reporting this message in virt-handler logs. "{"component":"virt-handler","kind":"","level":"info","msg":"Signaled graceful shutdown for win10-sore-sheep","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1781","timestamp":"2021-03-03T01:05:18.933466Z","uid":"312d804c-6e16-4df6-b9bc-c3a3897c6181"} "
it keeps loop below message in virt-handler logs for one hour. {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1307","timestamp":"2021-03-03T01:18:29.887671Z","uid":"312d804c-6e16-4df6-b9bc-c3a3897c6181"} {"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1312","timestamp":"2021-03-03T01:18:29.887716Z","uid":""} {"component":"virt-handler","kind":"","level":"info","msg":"Signaled graceful shutdown for win10-sore-sheep","name":"win10-sore-sheep","namespace":"default","pos":"vm.go:1781","timestamp":"2021-03-03T01:18:29.891856Z","uid":"312d804c-6e16-4df6-b9bc-c3a3897c6181"} {"component":"virt-handler","level":"info","msg":"Processing event default/win10-sore-sheep","pos":"vm.go:1305","timestamp":"2021-03-03T01:18:34.892799Z"}
@Igor, based on Guohua's infomration, I would expect the gyest to shut down once it has booted and received the ACPI event. Could it be that in this specific flow something causes the event not to reach the OS?
Thank you for the elaboration of the issue. @Guohua is there a chance you can reproduce this issue, but this time add a label to the VM resource called "DebugLogs: true" and when it happens can you please s take virt-launcher logs? Thank you!
Created attachment 1762568 [details] virt-launcher.log Reproduced and provides virt-launcher logs
I think this bug is solved by the two PRs that also solved https://bugzilla.redhat.com/show_bug.cgi?id=1900631 We have changed it so that multiple calls to graceful shutdown can take place, as well as adding a `virtctl stop --force --grace-period 0`
Targetting this to the next release as we believe this is now fixed. Per Comment #13.
To verify: Steps to Reproduce: 1. Create a VM 2. Delete it just after it become "Running" 3. VM should be deleted after terminationGracePeriodSeconds.
verify with build: HCO:[v4.9.0-144] step : 1. create vm 2. start vm 3. stop vm just after it become "Running" $virtctl stop vm-win10 --force --grace-period=0 VM vm-win10 was scheduled to stop 4. vm stopped after terminationGracePeriodSeconds wait 1 sec, the vm change to Stopped, and the pod is removed. $oc get vm vm-win10 3h29m Stopped False test both rhel and windows vm. also test delete vm $ oc delete vm vm-win10 --force --grace-period=0 warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. virtualmachine.kubevirt.io "vm-win10" force deleted check the vm, is been removed. move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104