Bug 1773406

Summary:	cri-o is unable to handle termination of containers and moves the node to NotReady
Product:	OpenShift Container Platform	Reporter:	Venkata Tadimarri <ktadimar>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Status:	CLOSED ERRATA	QA Contact:	Weinan Liu <weinliu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.11.0	CC:	aos-bugs, jokerman, mpatel, nagrawal, rkshirsa, rphillips
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	Hello Team, We are yet to receive an update from engineering on the information provided. We will update the case as soon as we get any new update. Regards, Krishna
Fixed In Version:	cri-o-1.11.16-0.10.dev.rhaos3.11.git1eee681.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-27 13:49:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Venkata Tadimarri 2019-11-18 04:21:51 UTC

Description of problem:

If the containers are exiting with a SIGTERM or during app deployments when the previous versions are not terminated gracefully, the node is set to Notready status for a brief period. 

Version-Release number of selected component (if applicable):

oc v3.11.82
kubernetes v1.11.0+d4cacc0
openshift v3.11.82
kubernetes v1.11.0+d4cacc0
crio version 1.11.11-1.rhaos3.11.git474f73d.el7

How reproducible:


Steps to Reproduce:

1. Example: Deployment: 1 x pod which runs two rhel7 containers. The first container runs a test script which will trap a SIGTERM and exit, the second just runs the sleep command. The "terminationGracePeriodSeconds" is set to 300 seconds

2. The pod deploys successfully.

[user@hostoca01 ~]$ oc create -f 2-rhel-no-mount.yaml
deploymentconfig.apps.openshift.io/rhel7-test created
NAME                 READY     STATUS    RESTARTS   AGE       IP              NODE                         NOMINATED NODE
rhel7-test-1-phkhv   2/2       Running   0          58s       10.220.11.144   hostname.example.com   <none>

Containers running on the node.

[root@localhost ~]# crictl ps |grep rhel7-test
209ba1ae69e85       5042491e9587be2e75205376bd1b9ed125b55b4699420f6bbc56e2aa84bd4762 24 minutes ago      Running             rhel7-test-2                 0
ff5799acd799c       5042491e9587be2e75205376bd1b9ed125b55b4699420f6bbc56e2aa84bd4762 24 minutes ago      Running             rhel7-test                   0


Perform deployment rollout.

[user@hostoca01 ~]$ oc rollout latest dc/rhel7-test
deploymentconfig.apps.openshift.io/rhel7-test rolled out

New pod deploys and the old pod is in terminating (SIGTERM sent to containers)

[user@hostoca01 ~]$ oc get pods -o wide
NAME                  READY     STATUS        RESTARTS   AGE       IP              NODE                         NOMINATED NODE
mailrelay-4-p7hh5     1/1       Running       1          2h        10.220.11.139   hostname.example.com   <none>
mailrelay-4-p9gqb     1/1       Running       1          2h        10.220.11.138   hostname.example.com   <none>
rhel7-test-1-phkhv    2/2       Terminating   0          26m       10.220.11.144   hostname.example.com   <none>
rhel7-test-2-deploy   1/1       Running       0          22s       10.220.11.145   hostname.example.com   <none>
rhel7-test-2-ql2c4    2/2       Running       0          20s       10.220.11.146   hostname.example.com   <none>

3. Run a new pod deployment of another rhel7 application

[user@hostoca01 ~]$ oc create -f 2-2-rhel-no-mount.yaml
deploymentconfig.apps.openshift.io/rhel7-test-7 created


The deployment and pod hang in a ContainerCreating state.

[user@hostoca01 ~]$ oc get pods
NAME                    READY     STATUS              RESTARTS   AGE
mailrelay-4-p7hh5       1/1       Running             1          2h
mailrelay-4-p9gqb       1/1       Running             1          2h
rhel7-test-1-phkhv      2/2       Terminating         0          28m
rhel7-test-2-deploy     1/1       Running             0          2m
rhel7-test-2-ql2c4      2/2       Running             0          2m
rhel7-test-7-1-5z88c    0/2       ContainerCreating   0          3s
rhel7-test-7-1-deploy   0/1       ContainerCreating   0          4s


PLEG reports not healthy after 3 minutes and the node is put in a "NotReady" state

Nov 14 08:56:59 localhost atomic-openshift-node[6231]: I1114 08:56:59.244148    6231 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.842605295s ago; threshold is 3m0s]
Nov 14 08:57:00 localhost atomic-openshift-node[6231]: I1114 08:57:00.426327    6231 kubelet_node_status.go:441] Recording NodeNotReady event message for node hostname.example.com


After the terminationGracePeriodSeconds expires a SIGKILL is sent and the pod is deleted and removed.

Nov 14 08:59:09 localhost atomic-openshift-node[6231]: I1114 08:59:09.432090    6231 kubelet.go:1836] SyncLoop (DELETE, "api"): "rhel7-test-1-phkhv_unix(bd44db9a-064b-11ea-bb8d-005056bf0498)"
Nov 14 08:59:09 localhost atomic-openshift-node[6231]: I1114 08:59:09.437819    6231 kubelet.go:1830] SyncLoop (REMOVE, "api"): "rhel7-test-1-phkhv_unix(bd44db9a-064b-11ea-bb8d-005056bf0498)"


The rollout + new deployment is then successful after waiting for the container to be Killed (not good as node is unusable for new workload).

[user@hostoca01 ~]$ oc get pods
NAME                   READY     STATUS    RESTARTS   AGE
rhel7-test-2-ql2c4     2/2       Running   0          9m
rhel7-test-7-1-5z88c   2/2       Running   0          6m


Actual results:

SIGTERM is not handled properly by cri-o

Expected results:

-> Any new deployments after the previous containers that are terminated, should complete successfully. 

Additional info:

dc of the test application, journal logs during the time will be attached.

Comment 13 Peter Hunt 2020-06-17 20:05:55 UTC

should be fixed in attached version

Comment 21 errata-xmlrpc 2020-07-27 13:49:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990