Description of problem: In the scenario where the control plane is not accessible or is lost, triggering a soft shutdown of the OCP worker nodes does not trigger soft shut down of the VMs running in OCP Virtualization causing corruption of VMs. Version-Release number of selected component (if applicable): Tested in OCP 4.7.13 and OpenShift Virtualization 2.6.4 How reproducible: Every time Steps to Reproduce: 1. Run VM via OCP Virtualization 2. Trigger soft shutdown via command line or IPMI interface of the server 3. There is no soft shutdown triggered on the VMs Actual results: VMs are turned off instantly instead of triggering a soft shutdown Expected results: A soft shut down should be triggered Additional info:
It looks like the API for this might still be in beta. https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/ We can start working toward this, but deferring to future for now.
I would expect VMs to be migrated away from the node which is shutting down. Non-migratable VMs should be shut down. Am I correct?
according to the case https://access.redhat.com/support/cases/#/case/03040834/discussion?commentId=a0a2K00000d90rOQAQ the customer has terminationGracePeriodSeconds: 0 in their vm yaml. We recommend (in the templates we ship) to set this to 100 for Linux VMs and 3600 for Windows VMs. They also have labels: os.template.kubevirt.io/rhel8.2: 'true' which probably means that they started with 100 and intentionally changed to 0. We should find out why, and suggest that they use a higher value to give RHEL a better opportunity to shut down cleanly.
@aamirian reports that increasing the grace period does not help - the guest is never requested to shut down cleanly. Can we tell why?
Hey @gveitmic, Thank you very much for your insights. Let me also share some info that may be interesting. I created a VM and and execed bash in it (via k exec -it virt-launcher-vmi-fedora-4h8zx -- bash). I then executed ps to see process' stats: bash-4.4# ps -A -o pid,stat,comm PID STAT COMMAND 1 Ssl virt-launcher 15 Sl virt-launcher 24 Sl libvirtd 25 S virtlogd 81 Sl qemu-kvm 284 Ss bash 291 R+ ps We can see that virt-launcher is PID 1 and its stats involve "s" which stands for session leader. Looks good. Furthermore, I've tried to SSH into the node that runs the VM and search for systemd units: [root@node02 vagrant]# systemctl | grep virt- var-lib-kubelet-pods-9154ac24\x2dee42\x2d4057\x2daef9\x2df1fa00570e5c-volumes-kubernetes.io\x7esecret-kubevirt\x2dhandler\x2dtoken\x2d6kp6q.mount loaded active mounted /var/lib/kubelet/pods/9154ac24-ee42-4057-aef9-f1fa00570e5c/volumes/kubernetes.io~secret/kubevirt-handler-token-6kp6q var-lib-kubelet-pods-9154ac24\x2dee42\x2d4057\x2daef9\x2df1fa00570e5c-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dhandler\x2dcerts.mount loaded active mounted /var/lib/kubelet/pods/9154ac24-ee42-4057-aef9-f1fa00570e5c/volumes/kubernetes.io~secret/kubevirt-virt-handler-certs var-lib-kubelet-pods-9154ac24\x2dee42\x2d4057\x2daef9\x2df1fa00570e5c-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dhandler\x2dserver\x2dcerts.mount loaded active mounted /var/lib/kubelet/pods/9154ac24-ee42-4057-aef9-f1fa00570e5c/volumes/kubernetes.io~secret/kubevirt-virt-handler-server-certs var-lib-kubelet-pods-9fdd0da5\x2d463f\x2d4db5\x2d8db2\x2d8c845c47c4a2-volumes-kubernetes.io\x7esecret-kubevirt\x2dcontroller\x2dcerts.mount loaded active mounted /var/lib/kubelet/pods/9fdd0da5-463f-4db5-8db2-8c845c47c4a2/volumes/kubernetes.io~secret/kubevirt-controller-certs var-lib-kubelet-pods-9fdd0da5\x2d463f\x2d4db5\x2d8db2\x2d8c845c47c4a2-volumes-kubernetes.io\x7esecret-kubevirt\x2dcontroller\x2dtoken\x2dfqt49.mount loaded active mounted /var/lib/kubelet/pods/9fdd0da5-463f-4db5-8db2-8c845c47c4a2/volumes/kubernetes.io~secret/kubevirt-controller-token-fqt49 var-lib-kubelet-pods-a0bfbe80\x2d3e3f\x2d46ad\x2d80ac\x2d132f8e89b078-volumes-kubernetes.io\x7esecret-kubevirt\x2doperator\x2dcerts.mount loaded active mounted /var/lib/kubelet/pods/a0bfbe80-3e3f-46ad-80ac-132f8e89b078/volumes/kubernetes.io~secret/kubevirt-operator-certs var-lib-kubelet-pods-a0bfbe80\x2d3e3f\x2d46ad\x2d80ac\x2d132f8e89b078-volumes-kubernetes.io\x7esecret-kubevirt\x2doperator\x2dtoken\x2d6cvjh.mount loaded active mounted /var/lib/kubelet/pods/a0bfbe80-3e3f-46ad-80ac-132f8e89b078/volumes/kubernetes.io~secret/kubevirt-operator-token-6cvjh var-lib-kubelet-pods-b9c4728e\x2df13b\x2d4422\x2db76f\x2d35714873e4aa-volumes-kubernetes.io\x7esecret-kubevirt\x2dtesting\x2dtoken\x2ddx9bg.mount loaded active mounted /var/lib/kubelet/pods/b9c4728e-f13b-4422-b76f-35714873e4aa/volumes/kubernetes.io~secret/kubevirt-testing-token-dx9bg var-lib-kubelet-pods-e361187b\x2d4da2\x2d4abc\x2da319\x2d5a58a7e19c3b-volumes-kubernetes.io\x7esecret-kubevirt\x2dapiserver\x2dtoken\x2d7cvpk.mount loaded active mounted /var/lib/kubelet/pods/e361187b-4da2-4abc-a319-5a58a7e19c3b/volumes/kubernetes.io~secret/kubevirt-apiserver-token-7cvpk var-lib-kubelet-pods-e361187b\x2d4da2\x2d4abc\x2da319\x2d5a58a7e19c3b-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dapi\x2dcerts.mount loaded active mounted /var/lib/kubelet/pods/e361187b-4da2-4abc-a319-5a58a7e19c3b/volumes/kubernetes.io~secret/kubevirt-virt-api-certs var-lib-kubelet-pods-e361187b\x2d4da2\x2d4abc\x2da319\x2d5a58a7e19c3b-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dhandler\x2dcerts.mount loaded active mounted /var/lib/kubelet/pods/e361187b-4da2-4abc-a319-5a58a7e19c3b/volumes/kubernetes.io~secret/kubevirt-virt-handler-certs We can see a lot of kubevirt components here, but I can't see virt-launcher. Could it be that the problem is that virt-launcher isn't considered a systemd unit for some reason?
(In reply to Itamar Holder from comment #26) > We can see a lot of kubevirt components here, but I can't see virt-launcher. > Could it be that the problem is that virt-launcher isn't considered a > systemd unit for some reason? I don't really know. My understanding is that systemd on the host (not container) will handle the shutdown of the crio's systemd scope of the Virtual Machine by sending signals to the processes inside it. And given its a scope there are some limitations on the options available on the scope definition file for the mode the scope is killed. We need some systemd help as you mentioned in the Jira. I don't mind doing the tests again to collect some more data from the systemd side (i.e. some debug), but I need to know what to turn on in terms of debug.
Itamar, IIUIC then https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/ tells us that the problem described here (controlled node shutdown, randomly killed containers) is expected up to today. But 1.21 is introducing graceful node shutdown as a mitigation to allow containers to nicely shut down. Thus: DId you try how KubeVirt behaves with the feature enabled in 1.21?
The Node team is working on enabling the graceful node shutdown feature. Note that we added a new feature that does the shutdown based on pod priorities https://github.com/kubernetes/enhancements/issues/2712 It just got merged in 1.23 and is still alpha. We want to enable this new feature on OpenShift and the team is working through testing and fixing issues.
@fdeutsch Thank you for that! Very Interesting! I've tested this with 1.21 and 1.22. It didn't work, but I think the link you posted gives an explanation for that. They say that kubelet needs to be configured, and that "by default, both configuration options described above, ShutdownGracePeriod and ShutdownGracePeriodCriticalPods are set to zero". This basically means that it is outside of Kubevirt's control at the upstream case. But it certainly means that downstream we should be able to support it. @mpatel Cool! Very happy to hear about it. Thanks for your response! @danken How do you suggest to continue?
Itamar, > I've tested this with 1.21 and 1.22. It didn't work, but I think the link you posted gives an explanation for that. Right. And: Yes, it is outside of KubeVirt's scope, but if we know that it has an effect, then we know the long term solution and do not need to invest in any workaround. Let me clarify my ask: > Thus: DId you try how KubeVirt behaves with the feature enabled in 1.21? Itamar, please try to enable the feature (graceful node shutdown) (by whatever means are needed, I guess using a K8s featureflag and setting ShutdownGracePeriod and ShutdownGracePeriodCriticalPods), and then check how KubeVirt behaves?
Hey @fdeutsch, I've tested this on k8s 1.22 with GracefulNodeShutdown feature gate explicitly enabled (although it should be enabled by default) and passed ShutdownGracePeriod and ShutdownGracePeriodCriticalPods flags with the value of 150 seconds. Unfortunately it didn't help. Then I reached out to @rphillips from Node team which said that they "have found that the graceful shutdown feature has some short comings and does not work as intended. We will need to fix the issues in it". Since it seems there isn't much to do here from CNV's side, can we pass this bug to Node team?
Thanks for the update Itamar. It is a precondition that the feature will be fixed, but once it is: Do we know which process inside the container will get the relevant signal? Will one process or all of them get it? What signal will it be? Will qemu get it before the launcher? … Thus from my perspective it looks like there still is some unclarity, and if there is, then it is to early to push it off and completely rely on node.
Hey @fdeutsch, > Do we know which process inside the container will get the relevant signal? Inside the virt-launcher POD, virt-launcher process has PID 1 and is marked as the unit leader. So it should receive the SIGTERM. When directly sending SIGTERM to only to it - graceful shutdown is tested to occur. > Will one process or all of them get it? Currently all of them get the signal, which is unexpected, and this is what Node team works to fix (and is outside of Kubevirt's hands ATM). > What signal will it be? SIGTERM. After some grace period (that is dependent on OS/Systemd configuration), if processes did not shut themselves down, a SIGKILL is sent. > Thus from my perspective it looks like there still is some unclarity, and if there is, then it is to early to push it off and completely rely on node. Sure, got it. Maybe we can add another BZ to Node? Just so we can keep track on anything?
All good answers, Itamar. We should at least keep it open to validate this bug, once the fix has landed. Marking this bug therefor as TestOnly for now. Itamar, is there an OCP Node bug which we can add as a blocker for this bug? mainly to ensure we keep track of the node work.
Fabian - that's a good question. Perhaps @mpatel or @rphillip from Node team can help answer it?
Moving this past the 4.10.0 time horizon based on Comment #32 through Comment #35
adjusting priority, as we depend on consumption of k8s functionality into OCP.
Deferring this to 4.11 because we're still missing a Kubernetes feature to make this possible.
OCPNODE-549 has not moved due to capacity. Therefore it got deferred, also because there are no known gaps on the kubevirt side ("In theory it works")
Deferring this to 4.15 as https://issues.redhat.com/browse/OCPNODE-1211 is still in 'TODO'.