Bug 1710502

Summary:	node-ca and image-registry incorrectly handle graceful shutdown
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Image Registry	Assignee:	Ricardo Maraschini <rmarasch>
Status:	CLOSED ERRATA	QA Contact:	Wenjing Zheng <wzheng>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	adam.kaplan, aos-bugs, rmarasch
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1710506 (view as bug list)		Environment:
Last Closed:	2019-10-16 06:28:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1710506

Description Clayton Coleman 2019-05-15 16:18:11 UTC

When a container is terminated by the kubelet (graceful, with TERM), all containers that are "RestartAlways" must behave a certain way:

1. Exit promptly (gracefully shutting down)
2. Return exit code 0 if no error was encountered during graceful shutdown

From an e2e run this was reported:

May 14 21:19:13.355 E ns/openshift-image-registry pod/image-registry-8fd5866b-nn6gr node/ip-10-0-135-167.ec2.internal container=registry container exited with code 137 (Error): 
May 14 21:19:13.396 E ns/openshift-image-registry pod/node-ca-sw986 node/ip-10-0-135-167.ec2.internal container=node-ca container exited with code 137 (Error): 
May 14 21:58:52.847 E ns/openshift-image-registry pod/node-ca-42h5w node/ip-10-0-135-167.ec2.internal container=node-ca container exited with code 137 (Error): 

when the pods were evicted off the node.

Two separate problems:

1. node-ca needs to follow the "handle TERM gracefully" pattern for bash in a container:

```
trap 'jobs -p | xargs -r kill; exit 0' TERM
```

at the top of the job, with `sleep 60 & wait` being used (which allows bash to interrupt the sleep when the pod is terminated).

This is sufficient for node-ca to satisfy the requirements

2. image-registry must return exit code 0, and SHOULD perform some level of graceful shutdown (this is more of a card, I can accept a card being spawned and prioritized separately for graceful, but the exit code must be fixed).

This can be fixed in the 4.1.z release, not GA blocking.

Comment 2 Wenjing Zheng 2019-09-03 09:21:40 UTC

Verified on 4.2.0-0.nightly-2019-09-02-172410:

$ oc delete pods/node-ca-52w56
pod "node-ca-52w56" deleted
$ oc logs pods/node-ca-52w56 --follow
image-registry.openshift-image-registry.svc:5000
shutting down node-ca
rpc error: code = Unknown desc = specified container not found: 9895ae3367ae68113eeed98aafecb95ecf8cb4c87d527adca0b960174b68f6e3

$ oc delete pods/image-registry-6fff5879b9-bd4wf
pod "image-registry-6fff5879b9-bd4wf" deleted
$oc logs pods/image-registry-6fff5879b9-bd4wf --follow
time="2019-09-03T09:18:30.279260632Z" level=info msg="shutting down image registry server" go.version=go1.11.13
time="2019-09-03T09:18:30.27966984Z" level=info msg="server shutdown, bye." go.version=go1.11.13
rpc error: code = Unknown desc = container with ID starting with b3cc56003867a527f63502a7167644167a1cc2776c27537eec809e34757bd165 not found: ID does not exist

Comment 3 errata-xmlrpc 2019-10-16 06:28:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922