Bug 1729979 - drain takes 600s to evict the image-registry pod and container is still running afterwards
Summary: drain takes 600s to evict the image-registry pod and container is still runni...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: 4.2.0
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks: 1737379 1744948
TreeView+ depends on / blocked
 
Reported: 2019-07-15 13:26 UTC by Antonio Murdaca
Modified: 2019-10-16 06:29 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the image registry was running from sh and signals were not propagated to the image registry. Consequence: the registry wasn't able to receive SIGTERM and it took few minutes to terminate the registry pod. Fix: replace sh with the registry process using exec Result: the registry is the pid 1 process and it's able to receive SIGTERM
Clone Of:
: 1737379 1744948 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:29:43 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift image-registry pull 183 None None None 2019-07-18 14:39:29 UTC
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:29:58 UTC

Description Antonio Murdaca 2019-07-15 13:26:13 UTC
Description of problem:

MCO calls into drain to evict pods on nodes. It ignores daemonsets and correctly drain the whole node.

There's a bug in the drain library being fixed here https://github.com/openshift/machine-config-operator/pull/962 which showed a weird behavior from the image-registry pod though (it wasn't happening before that PR since the drain library wasn't correctly waiting on pod evictions).

With the PR linked above, a drain operation on the node where the image-registry pod lives take 600s+ and even after evicting you can still see the image-registry container laying around.


Version-Release number of selected component (if applicable):

4.2 for this bug - will open for 4.1 as well later


How reproducible:

always, found a pure kubectl drain reproducer as well


Steps to Reproduce:

$ node=$(oc get pod -o go-template --template '{{.spec.nodeName}}' -n openshift-image-registry $(oc get pods --all-namespaces -l docker-registry=default -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | head -1))
$ kubectl drain --delete-local-data=true --force=true --grace-period=600 --ignore-daemonsets=true $node

Actual results:

drain takes 600+s to drain - other pods are evicted just fine


Expected results:

drain works w/o waiting so much time on the image-registry


Additional info:

Comment 2 Antonio Murdaca 2019-07-31 12:58:12 UTC
Has this been cloned and targeted for 4.1? The MCO needs a 4.1 fix for https://github.com/openshift/machine-config-operator/pull/1023

Comment 5 errata-xmlrpc 2019-10-16 06:29:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.