1729979 – drain takes 600s to evict the image-registry pod and container is still running afterwards

Bug 1729979 - drain takes 600s to evict the image-registry pod and container is still running afterwards

Summary: drain takes 600s to evict the image-registry pod and container is still runni...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Oleg Bulatov
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1737379 1744948
TreeView+	depends on / blocked

Reported:	2019-07-15 13:26 UTC by Antonio Murdaca
Modified:	2019-10-16 06:29 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the image registry was running from sh and signals were not propagated to the image registry. Consequence: the registry wasn't able to receive SIGTERM and it took few minutes to terminate the registry pod. Fix: replace sh with the registry process using exec Result: the registry is the pid 1 process and it's able to receive SIGTERM
Clone Of:
Clones:	1737379 1744948 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:29:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift image-registry pull 183	0	None	None	None	2019-07-18 14:39:29 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:29:58 UTC

Description Antonio Murdaca 2019-07-15 13:26:13 UTC

Description of problem:

MCO calls into drain to evict pods on nodes. It ignores daemonsets and correctly drain the whole node.

There's a bug in the drain library being fixed here https://github.com/openshift/machine-config-operator/pull/962 which showed a weird behavior from the image-registry pod though (it wasn't happening before that PR since the drain library wasn't correctly waiting on pod evictions).

With the PR linked above, a drain operation on the node where the image-registry pod lives take 600s+ and even after evicting you can still see the image-registry container laying around.


Version-Release number of selected component (if applicable):

4.2 for this bug - will open for 4.1 as well later


How reproducible:

always, found a pure kubectl drain reproducer as well


Steps to Reproduce:

$ node=$(oc get pod -o go-template --template '{{.spec.nodeName}}' -n openshift-image-registry $(oc get pods --all-namespaces -l docker-registry=default -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | head -1))
$ kubectl drain --delete-local-data=true --force=true --grace-period=600 --ignore-daemonsets=true $node

Actual results:

drain takes 600+s to drain - other pods are evicted just fine


Expected results:

drain works w/o waiting so much time on the image-registry


Additional info:

Comment 2 Antonio Murdaca 2019-07-31 12:58:12 UTC

Has this been cloned and targeted for 4.1? The MCO needs a 4.1 fix for https://github.com/openshift/machine-config-operator/pull/1023

Comment 5 errata-xmlrpc 2019-10-16 06:29:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.