Description of problem: If the containers are exiting with a SIGTERM or during app deployments when the previous versions are not terminated gracefully, the node is set to Notready status for a brief period. Version-Release number of selected component (if applicable): oc v3.11.82 kubernetes v1.11.0+d4cacc0 openshift v3.11.82 kubernetes v1.11.0+d4cacc0 crio version 1.11.11-1.rhaos3.11.git474f73d.el7 How reproducible: Steps to Reproduce: 1. Example: Deployment: 1 x pod which runs two rhel7 containers. The first container runs a test script which will trap a SIGTERM and exit, the second just runs the sleep command. The "terminationGracePeriodSeconds" is set to 300 seconds 2. The pod deploys successfully. [user@hostoca01 ~]$ oc create -f 2-rhel-no-mount.yaml deploymentconfig.apps.openshift.io/rhel7-test created NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE rhel7-test-1-phkhv 2/2 Running 0 58s 10.220.11.144 hostname.example.com <none> Containers running on the node. [root@localhost ~]# crictl ps |grep rhel7-test 209ba1ae69e85 5042491e9587be2e75205376bd1b9ed125b55b4699420f6bbc56e2aa84bd4762 24 minutes ago Running rhel7-test-2 0 ff5799acd799c 5042491e9587be2e75205376bd1b9ed125b55b4699420f6bbc56e2aa84bd4762 24 minutes ago Running rhel7-test 0 Perform deployment rollout. [user@hostoca01 ~]$ oc rollout latest dc/rhel7-test deploymentconfig.apps.openshift.io/rhel7-test rolled out New pod deploys and the old pod is in terminating (SIGTERM sent to containers) [user@hostoca01 ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE mailrelay-4-p7hh5 1/1 Running 1 2h 10.220.11.139 hostname.example.com <none> mailrelay-4-p9gqb 1/1 Running 1 2h 10.220.11.138 hostname.example.com <none> rhel7-test-1-phkhv 2/2 Terminating 0 26m 10.220.11.144 hostname.example.com <none> rhel7-test-2-deploy 1/1 Running 0 22s 10.220.11.145 hostname.example.com <none> rhel7-test-2-ql2c4 2/2 Running 0 20s 10.220.11.146 hostname.example.com <none> 3. Run a new pod deployment of another rhel7 application [user@hostoca01 ~]$ oc create -f 2-2-rhel-no-mount.yaml deploymentconfig.apps.openshift.io/rhel7-test-7 created The deployment and pod hang in a ContainerCreating state. [user@hostoca01 ~]$ oc get pods NAME READY STATUS RESTARTS AGE mailrelay-4-p7hh5 1/1 Running 1 2h mailrelay-4-p9gqb 1/1 Running 1 2h rhel7-test-1-phkhv 2/2 Terminating 0 28m rhel7-test-2-deploy 1/1 Running 0 2m rhel7-test-2-ql2c4 2/2 Running 0 2m rhel7-test-7-1-5z88c 0/2 ContainerCreating 0 3s rhel7-test-7-1-deploy 0/1 ContainerCreating 0 4s PLEG reports not healthy after 3 minutes and the node is put in a "NotReady" state Nov 14 08:56:59 localhost atomic-openshift-node[6231]: I1114 08:56:59.244148 6231 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.842605295s ago; threshold is 3m0s] Nov 14 08:57:00 localhost atomic-openshift-node[6231]: I1114 08:57:00.426327 6231 kubelet_node_status.go:441] Recording NodeNotReady event message for node hostname.example.com After the terminationGracePeriodSeconds expires a SIGKILL is sent and the pod is deleted and removed. Nov 14 08:59:09 localhost atomic-openshift-node[6231]: I1114 08:59:09.432090 6231 kubelet.go:1836] SyncLoop (DELETE, "api"): "rhel7-test-1-phkhv_unix(bd44db9a-064b-11ea-bb8d-005056bf0498)" Nov 14 08:59:09 localhost atomic-openshift-node[6231]: I1114 08:59:09.437819 6231 kubelet.go:1830] SyncLoop (REMOVE, "api"): "rhel7-test-1-phkhv_unix(bd44db9a-064b-11ea-bb8d-005056bf0498)" The rollout + new deployment is then successful after waiting for the container to be Killed (not good as node is unusable for new workload). [user@hostoca01 ~]$ oc get pods NAME READY STATUS RESTARTS AGE rhel7-test-2-ql2c4 2/2 Running 0 9m rhel7-test-7-1-5z88c 2/2 Running 0 6m Actual results: SIGTERM is not handled properly by cri-o Expected results: -> Any new deployments after the previous containers that are terminated, should complete successfully. Additional info: dc of the test application, journal logs during the time will be attached.
should be fixed in attached version
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2990