1710506 – [4.1.z] node-ca and image-registry incorrectly handle graceful shutdown

Bug 1710506 - [4.1.z] node-ca and image-registry incorrectly handle graceful shutdown

Summary: [4.1.z] node-ca and image-registry incorrectly handle graceful shutdown

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Ricardo Maraschini
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:	1710502
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-15 16:21 UTC by Adam Kaplan
Modified:	2019-09-27 00:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1710502
Environment:
Last Closed:	2019-09-27 00:33:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 381	None	closed	[release-4.1] Bug 1710506: Using trap to deal with signaling on node-ca.	2020-10-01 10:27:53 UTC
Github	openshift image-registry pull 193	None	closed	[release-4.1] Bug 1710506: Implementing graceful shutdown.	2020-10-01 10:27:53 UTC
Red Hat Bugzilla	1710502	unspecified	CLOSED	node-ca and image-registry incorrectly handle graceful shutdown	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2019:2856	None	None	None	2019-09-27 00:34:00 UTC

Description Adam Kaplan 2019-05-15 16:21:54 UTC

+++ This bug was initially created as a clone of Bug #1710502 +++

When a container is terminated by the kubelet (graceful, with TERM), all containers that are "RestartAlways" must behave a certain way:

1. Exit promptly (gracefully shutting down)
2. Return exit code 0 if no error was encountered during graceful shutdown

From an e2e run this was reported:

May 14 21:19:13.355 E ns/openshift-image-registry pod/image-registry-8fd5866b-nn6gr node/ip-10-0-135-167.ec2.internal container=registry container exited with code 137 (Error): 
May 14 21:19:13.396 E ns/openshift-image-registry pod/node-ca-sw986 node/ip-10-0-135-167.ec2.internal container=node-ca container exited with code 137 (Error): 
May 14 21:58:52.847 E ns/openshift-image-registry pod/node-ca-42h5w node/ip-10-0-135-167.ec2.internal container=node-ca container exited with code 137 (Error): 

when the pods were evicted off the node.

Two separate problems:

1. node-ca needs to follow the "handle TERM gracefully" pattern for bash in a container:

```
trap 'jobs -p | xargs -r kill; exit 0' TERM
```

at the top of the job, with `sleep 60 & wait` being used (which allows bash to interrupt the sleep when the pod is terminated).

This is sufficient for node-ca to satisfy the requirements

2. image-registry must return exit code 0, and SHOULD perform some level of graceful shutdown (this is more of a card, I can accept a card being spawned and prioritized separately for graceful, but the exit code must be fixed).

This can be fixed in the 4.1.z release, not GA blocking.

Comment 2 Wenjing Zheng 2019-09-23 03:39:34 UTC

Verified in 4.1.17:
$ oc delete pods/image-registry-7bd9f684b9-chg56
time="2019-09-23T03:25:40.657038016Z" level=info msg="shutting down image registry server" go.version=go1.10.8
time="2019-09-23T03:25:40.65735496Z" level=info msg="server shutdown, bye." go.version=go1.10.8
rpc error: code = Unknown desc = container with ID starting with c5079dfe8bf8480eb2f9430619c8bb088bb765bb0a0a43e11c4d8d1dee487e98 not found: ID does not exist
$ echo $?
0

$ oc delete pods/node-ca-97ltl
image-registry.openshift-image-registry.svc:5000
rpc error: code = Unknown desc = container with ID starting with 3d1225d8c692e7e19b64c741d3b42020a5c5dd1aa9556f820f93543d1f4e1770 not found: ID does not exist
$ echo $?
0

Comment 4 errata-xmlrpc 2019-09-27 00:33:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2856

Note You need to log in before you can comment on or make changes to this bug.