Bug 1729979

Summary: drain takes 600s to evict the image-registry pod and container is still running afterwards
Product: OpenShift Container Platform Reporter: Antonio Murdaca <amurdaca>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: unspecified Docs Contact:
Priority: high    
Version: 4.1.0CC: aos-bugs
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the image registry was running from sh and signals were not propagated to the image registry. Consequence: the registry wasn't able to receive SIGTERM and it took few minutes to terminate the registry pod. Fix: replace sh with the registry process using exec Result: the registry is the pid 1 process and it's able to receive SIGTERM
Story Points: ---
Clone Of:
: 1737379 1744948 (view as bug list) Environment:
Last Closed: 2019-10-16 06:29:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1737379, 1744948    

Description Antonio Murdaca 2019-07-15 13:26:13 UTC
Description of problem:

MCO calls into drain to evict pods on nodes. It ignores daemonsets and correctly drain the whole node.

There's a bug in the drain library being fixed here https://github.com/openshift/machine-config-operator/pull/962 which showed a weird behavior from the image-registry pod though (it wasn't happening before that PR since the drain library wasn't correctly waiting on pod evictions).

With the PR linked above, a drain operation on the node where the image-registry pod lives take 600s+ and even after evicting you can still see the image-registry container laying around.

Version-Release number of selected component (if applicable):

4.2 for this bug - will open for 4.1 as well later

How reproducible:

always, found a pure kubectl drain reproducer as well

Steps to Reproduce:

$ node=$(oc get pod -o go-template --template '{{.spec.nodeName}}' -n openshift-image-registry $(oc get pods --all-namespaces -l docker-registry=default -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | head -1))
$ kubectl drain --delete-local-data=true --force=true --grace-period=600 --ignore-daemonsets=true $node

Actual results:

drain takes 600+s to drain - other pods are evicted just fine

Expected results:

drain works w/o waiting so much time on the image-registry

Additional info:

Comment 2 Antonio Murdaca 2019-07-31 12:58:12 UTC
Has this been cloned and targeted for 4.1? The MCO needs a 4.1 fix for https://github.com/openshift/machine-config-operator/pull/1023

Comment 5 errata-xmlrpc 2019-10-16 06:29:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.