Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1729979

Summary:	drain takes 600s to evict the image-registry pod and container is still running afterwards
Product:	OpenShift Container Platform	Reporter:	Antonio Murdaca <amurdaca>
Component:	Image Registry	Assignee:	Oleg Bulatov <obulatov>
Status:	CLOSED ERRATA	QA Contact:	Wenjing Zheng <wzheng>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	4.1.0	CC:	aos-bugs
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: the image registry was running from sh and signals were not propagated to the image registry. Consequence: the registry wasn't able to receive SIGTERM and it took few minutes to terminate the registry pod. Fix: replace sh with the registry process using exec Result: the registry is the pid 1 process and it's able to receive SIGTERM	Story Points:	---
Clone Of:
Clones:	1737379 1744948 (view as bug list)		Environment:
Last Closed:	2019-10-16 06:29:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1737379, 1744948

Description Antonio Murdaca 2019-07-15 13:26:13 UTC

Description of problem:

MCO calls into drain to evict pods on nodes. It ignores daemonsets and correctly drain the whole node.

There's a bug in the drain library being fixed here https://github.com/openshift/machine-config-operator/pull/962 which showed a weird behavior from the image-registry pod though (it wasn't happening before that PR since the drain library wasn't correctly waiting on pod evictions).

With the PR linked above, a drain operation on the node where the image-registry pod lives take 600s+ and even after evicting you can still see the image-registry container laying around.


Version-Release number of selected component (if applicable):

4.2 for this bug - will open for 4.1 as well later


How reproducible:

always, found a pure kubectl drain reproducer as well


Steps to Reproduce:

$ node=$(oc get pod -o go-template --template '{{.spec.nodeName}}' -n openshift-image-registry $(oc get pods --all-namespaces -l docker-registry=default -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | head -1))
$ kubectl drain --delete-local-data=true --force=true --grace-period=600 --ignore-daemonsets=true $node

Actual results:

drain takes 600+s to drain - other pods are evicted just fine


Expected results:

drain works w/o waiting so much time on the image-registry


Additional info:

Comment 2 Antonio Murdaca 2019-07-31 12:58:12 UTC

Has this been cloned and targeted for 4.1? The MCO needs a 4.1 fix for https://github.com/openshift/machine-config-operator/pull/1023

Comment 5 errata-xmlrpc 2019-10-16 06:29:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922