Bug 2006571

Summary: Soft shutdown of OCP Node does not trigger soft shutdown of VMs running on the node.
Product: Container Native Virtualization (CNV) Reporter: Arvin Amirian <aamirian>
Component: VirtualizationAssignee: Itamar Holder <iholder>
Status: ASSIGNED --- QA Contact: Kedar Bidarkar <kbidarka>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.6.4CC: acardace, ctomasko, danken, dholler, fdeutsch, germano, gveitmic, iholder, kmajcher, lpivarc, mpatel, mtessun, pelauter, rphillips, sgott, ycui
Target Milestone: ---Keywords: TestOnly
Target Release: 4.15.0Flags: iholder: needinfo? (rphillips)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Arvin Amirian 2021-09-21 22:29:14 UTC
Description of problem:
In the scenario where the control plane is not accessible or is lost, triggering a soft shutdown of the OCP worker nodes does not trigger soft shut down of the VMs running in OCP Virtualization causing corruption of VMs.


Version-Release number of selected component (if applicable):
Tested in OCP 4.7.13 and OpenShift Virtualization 2.6.4



How reproducible:
Every time


Steps to Reproduce:
1. Run VM via OCP Virtualization
2. Trigger soft shutdown via command line or IPMI interface of the server
3. There is no soft shutdown triggered on the VMs

Actual results:
VMs are turned off instantly instead of triggering a soft shutdown

Expected results:
A soft shut down should be triggered 


Additional info:

Comment 1 sgott 2021-09-22 12:20:32 UTC
It looks like the API for this might still be in beta.

https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/

We can start working toward this, but deferring to future for now.

Comment 3 Dan Kenigsberg 2021-10-18 13:26:44 UTC
I would expect VMs to be migrated away from the node which is shutting down. Non-migratable VMs should be shut down. Am I correct?

Comment 11 Dan Kenigsberg 2021-10-21 15:22:36 UTC
according to the case https://access.redhat.com/support/cases/#/case/03040834/discussion?commentId=a0a2K00000d90rOQAQ the customer has

      terminationGracePeriodSeconds: 0

in their vm yaml. We recommend (in the templates we ship) to set this to 100 for Linux VMs and 3600 for Windows VMs.

They also have

     labels:
        os.template.kubevirt.io/rhel8.2: 'true'

which probably means that they started with 100 and intentionally changed to 0. We should find out why, and suggest that they use a higher value to give RHEL a better opportunity to shut down cleanly.

Comment 12 Dan Kenigsberg 2021-10-21 15:35:02 UTC
@aamirian reports that increasing the grace period does not help - the guest is never requested to shut down cleanly. Can we tell why?

Comment 26 Itamar Holder 2021-11-22 11:58:36 UTC
Hey @gveitmic,

Thank you very much for your insights.

Let me also share some info that may be interesting.
I created a VM and and execed bash in it (via k exec -it virt-launcher-vmi-fedora-4h8zx -- bash). I then executed ps to see process' stats:

bash-4.4# ps -A -o pid,stat,comm
    PID STAT COMMAND
      1 Ssl  virt-launcher
     15 Sl   virt-launcher
     24 Sl   libvirtd
     25 S    virtlogd
     81 Sl   qemu-kvm
    284 Ss   bash
    291 R+   ps

We can see that virt-launcher is PID 1 and its stats involve "s" which stands for session leader. Looks good.

Furthermore, I've tried to SSH into the node that runs the VM and search for systemd units:

[root@node02 vagrant]# systemctl | grep virt-
var-lib-kubelet-pods-9154ac24\x2dee42\x2d4057\x2daef9\x2df1fa00570e5c-volumes-kubernetes.io\x7esecret-kubevirt\x2dhandler\x2dtoken\x2d6kp6q.mount                        loaded active mounted   /var/lib/kubelet/pods/9154ac24-ee42-4057-aef9-f1fa00570e5c/volumes/kubernetes.io~secret/kubevirt-handler-token-6kp6q              
var-lib-kubelet-pods-9154ac24\x2dee42\x2d4057\x2daef9\x2df1fa00570e5c-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dhandler\x2dcerts.mount                         loaded active mounted   /var/lib/kubelet/pods/9154ac24-ee42-4057-aef9-f1fa00570e5c/volumes/kubernetes.io~secret/kubevirt-virt-handler-certs               
var-lib-kubelet-pods-9154ac24\x2dee42\x2d4057\x2daef9\x2df1fa00570e5c-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dhandler\x2dserver\x2dcerts.mount               loaded active mounted   /var/lib/kubelet/pods/9154ac24-ee42-4057-aef9-f1fa00570e5c/volumes/kubernetes.io~secret/kubevirt-virt-handler-server-certs        
var-lib-kubelet-pods-9fdd0da5\x2d463f\x2d4db5\x2d8db2\x2d8c845c47c4a2-volumes-kubernetes.io\x7esecret-kubevirt\x2dcontroller\x2dcerts.mount                              loaded active mounted   /var/lib/kubelet/pods/9fdd0da5-463f-4db5-8db2-8c845c47c4a2/volumes/kubernetes.io~secret/kubevirt-controller-certs                 
var-lib-kubelet-pods-9fdd0da5\x2d463f\x2d4db5\x2d8db2\x2d8c845c47c4a2-volumes-kubernetes.io\x7esecret-kubevirt\x2dcontroller\x2dtoken\x2dfqt49.mount                     loaded active mounted   /var/lib/kubelet/pods/9fdd0da5-463f-4db5-8db2-8c845c47c4a2/volumes/kubernetes.io~secret/kubevirt-controller-token-fqt49           
var-lib-kubelet-pods-a0bfbe80\x2d3e3f\x2d46ad\x2d80ac\x2d132f8e89b078-volumes-kubernetes.io\x7esecret-kubevirt\x2doperator\x2dcerts.mount                                loaded active mounted   /var/lib/kubelet/pods/a0bfbe80-3e3f-46ad-80ac-132f8e89b078/volumes/kubernetes.io~secret/kubevirt-operator-certs                   
var-lib-kubelet-pods-a0bfbe80\x2d3e3f\x2d46ad\x2d80ac\x2d132f8e89b078-volumes-kubernetes.io\x7esecret-kubevirt\x2doperator\x2dtoken\x2d6cvjh.mount                       loaded active mounted   /var/lib/kubelet/pods/a0bfbe80-3e3f-46ad-80ac-132f8e89b078/volumes/kubernetes.io~secret/kubevirt-operator-token-6cvjh             
var-lib-kubelet-pods-b9c4728e\x2df13b\x2d4422\x2db76f\x2d35714873e4aa-volumes-kubernetes.io\x7esecret-kubevirt\x2dtesting\x2dtoken\x2ddx9bg.mount                        loaded active mounted   /var/lib/kubelet/pods/b9c4728e-f13b-4422-b76f-35714873e4aa/volumes/kubernetes.io~secret/kubevirt-testing-token-dx9bg              
var-lib-kubelet-pods-e361187b\x2d4da2\x2d4abc\x2da319\x2d5a58a7e19c3b-volumes-kubernetes.io\x7esecret-kubevirt\x2dapiserver\x2dtoken\x2d7cvpk.mount                      loaded active mounted   /var/lib/kubelet/pods/e361187b-4da2-4abc-a319-5a58a7e19c3b/volumes/kubernetes.io~secret/kubevirt-apiserver-token-7cvpk            
var-lib-kubelet-pods-e361187b\x2d4da2\x2d4abc\x2da319\x2d5a58a7e19c3b-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dapi\x2dcerts.mount                             loaded active mounted   /var/lib/kubelet/pods/e361187b-4da2-4abc-a319-5a58a7e19c3b/volumes/kubernetes.io~secret/kubevirt-virt-api-certs                   
var-lib-kubelet-pods-e361187b\x2d4da2\x2d4abc\x2da319\x2d5a58a7e19c3b-volumes-kubernetes.io\x7esecret-kubevirt\x2dvirt\x2dhandler\x2dcerts.mount                         loaded active mounted   /var/lib/kubelet/pods/e361187b-4da2-4abc-a319-5a58a7e19c3b/volumes/kubernetes.io~secret/kubevirt-virt-handler-certs      


We can see a lot of kubevirt components here, but I can't see virt-launcher. Could it be that the problem is that virt-launcher isn't considered a systemd unit for some reason?

Comment 27 Germano Veit Michel 2021-11-22 21:43:08 UTC
(In reply to Itamar Holder from comment #26)
> We can see a lot of kubevirt components here, but I can't see virt-launcher.
> Could it be that the problem is that virt-launcher isn't considered a
> systemd unit for some reason?

I don't really know. My understanding is that systemd on the host (not container) will handle the shutdown of the crio's systemd scope of the Virtual Machine by sending signals to the processes inside it. And given its a scope there are some limitations on the options available on the scope definition file for the mode the scope is killed.

We need some systemd help as you mentioned in the Jira. I don't mind doing the tests again to collect some more data from the systemd side (i.e. some debug), but I need to know what to turn on in terms of debug.

Comment 28 Fabian Deutsch 2021-11-30 14:30:14 UTC
Itamar, IIUIC then https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/ tells us that the problem described here (controlled node shutdown, randomly killed containers) is expected up to today.
But 1.21 is introducing graceful node shutdown as a mitigation to allow containers to nicely shut down.

Thus: DId you try how KubeVirt behaves with the feature enabled in 1.21?

Comment 29 Mrunal Patel 2021-12-01 21:47:38 UTC
The Node team is working on enabling the graceful node shutdown feature. Note that we added a new feature
that does the shutdown based on pod priorities https://github.com/kubernetes/enhancements/issues/2712
It just got merged in 1.23 and is still alpha. We want to enable this new feature on OpenShift and the
team is working through testing and fixing issues.

Comment 30 Itamar Holder 2021-12-02 08:59:55 UTC
@fdeutsch Thank you for that! Very Interesting!

I've tested this with 1.21 and 1.22. It didn't work, but I think the link you posted gives an explanation for that. They say that kubelet needs to be configured, and that "by default, both configuration options described above, ShutdownGracePeriod and ShutdownGracePeriodCriticalPods are set to zero". This basically means that it is outside of Kubevirt's control at the upstream case. But it certainly means that downstream we should be able to support it.

@mpatel Cool! Very happy to hear about it. Thanks for your response!

@danken How do you suggest to continue?

Comment 31 Fabian Deutsch 2021-12-02 10:24:26 UTC
Itamar,

> I've tested this with 1.21 and 1.22. It didn't work, but I think the link you posted gives an explanation for that.

Right.
And: Yes, it is outside of KubeVirt's scope, but if we know that it has an effect, then we know the long term solution and do not need to invest in any workaround.

Let me clarify my ask:

> Thus: DId you try how KubeVirt behaves with the feature enabled in 1.21?

Itamar, please try to enable the feature (graceful node shutdown) (by whatever means are needed, I guess using a K8s featureflag and setting ShutdownGracePeriod and ShutdownGracePeriodCriticalPods), and then check how KubeVirt behaves?

Comment 32 Itamar Holder 2021-12-06 08:45:52 UTC
Hey @fdeutsch,

I've tested this on k8s 1.22 with GracefulNodeShutdown feature gate explicitly enabled (although it should be enabled by default) and passed ShutdownGracePeriod and ShutdownGracePeriodCriticalPods flags with the value of 150 seconds. Unfortunately it didn't help.

Then I reached out to @rphillips from Node team which said that they "have found that the graceful shutdown feature has some short comings and does not work as intended. We will need to fix the issues in it".

Since it seems there isn't much to do here from CNV's side, can we pass this bug to Node team?

Comment 33 Fabian Deutsch 2021-12-06 10:52:21 UTC
Thanks for the update Itamar.

It is a precondition that the feature will be fixed, but once it is: Do we know which process inside the container will get the relevant signal? Will one process or all of them get it? What signal will it be? Will qemu get it before the launcher? …
Thus from my perspective it looks like there still is some unclarity, and if there is, then it is to early to push it off and completely rely on node.

Comment 34 Itamar Holder 2021-12-06 11:00:59 UTC
Hey @fdeutsch,

> Do we know which process inside the container will get the relevant signal?
Inside the virt-launcher POD, virt-launcher process has PID 1 and is marked as the unit leader. So it should receive the SIGTERM. When directly sending SIGTERM to only to it - graceful shutdown is tested to occur.

> Will one process or all of them get it?
Currently all of them get the signal, which is unexpected, and this is what Node team works to fix (and is outside of Kubevirt's hands ATM).

> What signal will it be?
SIGTERM. After some grace period (that is dependent on OS/Systemd configuration), if processes did not shut themselves down, a SIGKILL is sent.

> Thus from my perspective it looks like there still is some unclarity, and if there is, then it is to early to push it off and completely rely on node.
Sure, got it.

Maybe we can add another BZ to Node? Just so we can keep track on anything?

Comment 35 Fabian Deutsch 2021-12-06 11:21:48 UTC
All good answers, Itamar.

We should at least keep it open to validate this bug, once the fix has landed. Marking this bug therefor as TestOnly for now.

Itamar, is there an OCP Node bug which we can add as a blocker for this bug? mainly to ensure we keep track of the node work.

Comment 36 Itamar Holder 2021-12-06 13:22:37 UTC
Fabian - that's a good question.

Perhaps @mpatel or @rphillip from Node team can help answer it?

Comment 37 sgott 2021-12-14 22:28:42 UTC
Moving this past the 4.10.0 time horizon based on Comment #32 through Comment #35

Comment 39 Dan Kenigsberg 2021-12-22 09:24:40 UTC
adjusting priority, as we depend on consumption of k8s functionality into OCP.

Comment 40 sgott 2022-01-26 20:54:42 UTC
Deferring this to 4.11 because we're still missing a Kubernetes feature to make this possible.

Comment 45 Fabian Deutsch 2022-05-11 12:33:49 UTC
OCPNODE-549 has not moved due to capacity. Therefore it got deferred, also because there are no known gaps on the kubevirt side ("In theory it works")

Comment 49 Antonio Cardace 2023-07-11 12:28:39 UTC
Deferring this to 4.15 as https://issues.redhat.com/browse/OCPNODE-1211 is still in 'TODO'.