Bug 1628693 - Drain (server side eviction and client side) should ignore pods that are terminal
Summary: Drain (server side eviction and client side) should ignore pods that are term...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.11.z
Assignee: Seth Jennings
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-13 17:54 UTC by Clayton Coleman
Modified: 2019-07-23 19:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Fixes issue where drain would fail to evicted terminated DaemonSet pods and pods with local storage
Clone Of:
Environment:
Last Closed: 2019-07-23 19:56:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1753 0 None None None 2019-07-23 19:56:35 UTC

Description Clayton Coleman 2018-09-13 17:54:52 UTC
In 3.11 drain on a node with a terminal pod flags those as requiring extra work.  That is not correct.  A terminal pod can be safely deleted at any time.  In addition, none of the controller / local data guarantees need to be enforced, because terminal pods have satisfied their requirement to run to completion.

Expect:

Any terminal (success/failed) pod on a node, regardless of local volumes or being under a controller, can be deleted by drain.  Both server side evict and client side drain commands should skip those pods when checking for the constraints, and delete them like a normal pod, and wait for their deletion (kubelet needs to ack their deletion).

High severity because this blocked successfully draining a node that had a lot of jobs on it.

Comment 1 Clayton Coleman 2018-09-13 17:56:16 UTC
$ oc adm drain origin-ci-ig-n-51j0
node/origin-ci-ig-n-51j0 already cordoned
error: unable to drain node "origin-ci-ig-n-51j0", aborting command...

There are pending nodes to be drained:
 origin-ci-ig-n-51j0
error: pods with local storage (use --delete-local-data to override): config-updater-5d8cf557b4-4t2f4, artifacts-build, control-plane-build, node-build, base-build, src-build, e2e-aws-smoke, e2e-gcp, release-latest, integration, src-build, src-build, src-build, test-bin-build, prometheus-sidecar-build, 016cd424-b754-11e8-848c-0a58ac101e26, 136aaaee-b73f-11e8-848c-0a58ac101e26, 138b5b69-b776-11e8-848c-0a58ac101e26, 14411561-b776-11e8-848c-0a58ac101e26, 1495fdb9-b776-11e8-848c-0a58ac101e26, 183e2110-b776-11e8-b9f5-0a58ac10026c, 18453362-b776-11e8-b9f5-0a58ac10026c, 18c95536-b776-11e8-b9f5-0a58ac10026c, 18ca3153-b776-11e8-b9f5-0a58ac10026c, 1bba9408-b753-11e8-848c-0a58ac101e26, 21e1dde9-b754-11e8-b9f5-0a58ac10026c, 21e357b0-b754-11e8-b9f5-0a58ac10026c, 21e71f24-b754-11e8-b9f5-0a58ac10026c, 29be669e-b74c-11e8-b9f5-0a58ac10026c, 29c1b7f7-b74c-11e8-b9f5-0a58ac10026c, 29c267ec-b74c-11e8-b9f5-0a58ac10026c, 359bdefb-b740-11e8-928a-0a58ac100a7b, 359d60c6-b740-11e8-928a-0a58ac100a7b, 39791253-b754-11e8-84ce-0a58ac100061, 3dc71bbe-b737-11e8-928a-0a58ac100a7b, 3dc8cfcf-b737-11e8-928a-0a58ac100a7b, 3dc9ea25-b737-11e8-928a-0a58ac100a7b, 3dcac3b0-b737-11e8-928a-0a58ac100a7b, 4280a7ba-b754-11e8-b9f5-0a58ac10026c, 490eca98-b74e-11e8-848c-0a58ac101e26, 4e4abc53-b776-11e8-b9f5-0a58ac10026c, 57ab7495-b75c-11e8-848c-0a58ac101e26, 57aca054-b75c-11e8-848c-0a58ac101e26, 57ae202c-b75c-11e8-848c-0a58ac101e26, 5e0ab469-b775-11e8-b9f5-0a58ac10026c, 6004c887-b74f-11e8-b9f5-0a58ac10026c, 6006b162-b74f-11e8-b9f5-0a58ac10026c, 6008ada3-b74f-11e8-b9f5-0a58ac10026c, 6009624f-b74f-11e8-b9f5-0a58ac10026c, 600a0a61-b74f-11e8-b9f5-0a58ac10026c, 600aaea4-b74f-11e8-b9f5-0a58ac10026c, 600b5475-b74f-11e8-b9f5-0a58ac10026c, 600bfdfd-b74f-11e8-b9f5-0a58ac10026c, 600ca12b-b74f-11e8-b9f5-0a58ac10026c, 600dc1ec-b74f-11e8-b9f5-0a58ac10026c, 642516f5-b77a-11e8-848c-0a58ac101e26, 7058fa0b-b743-11e8-a9f3-0a58ac100a99, 74673ef9-b75c-11e8-848c-0a58ac101e26, 7b41f9cb-b753-11e8-848c-0a58ac101e26, 7b42ae11-b753-11e8-848c-0a58ac101e26, 8423c179-b743-11e8-848c-0a58ac101e26, 850e6ff1-b743-11e8-848c-0a58ac101e26, 8539df80-b753-11e8-848c-0a58ac101e26, 853b7a03-b753-11e8-848c-0a58ac101e26, 853d7e8a-b753-11e8-848c-0a58ac101e26, 853e61ec-b753-11e8-848c-0a58ac101e26, 8555c695-b743-11e8-848c-0a58ac101e26, 8b233576-b73f-11e8-928a-0a58ac100a7b, 913076f8-b73d-11e8-928a-0a58ac100a7b, 93700ddf-b743-11e8-928a-0a58ac100a7b, aa04a59d-b751-11e8-848c-0a58ac101e26, b87cec13-b737-11e8-b497-0a58ac100aa1, b87ed6d0-b737-11e8-b497-0a58ac100aa1, b87f90da-b737-11e8-b497-0a58ac100aa1, ba48dd07-b753-11e8-848c-0a58ac101e26, c30973f0-b74d-11e8-b9f5-0a58ac10026c, c30a3313-b74d-11e8-b9f5-0a58ac10026c, c30aecfc-b74d-11e8-b9f5-0a58ac10026c, c8f7827d-b775-11e8-848c-0a58ac101e26, config-updater-12-qntld, d22988e8-b75c-11e8-b9f5-0a58ac10026c, dbf9106f-b74a-11e8-848c-0a58ac101e26, dbfaeb90-b74a-11e8-848c-0a58ac101e26, dbfd2aac-b74a-11e8-848c-0a58ac101e26, dbfe2078-b74a-11e8-848c-0a58ac101e26, dd15ad0b-b773-11e8-848c-0a58ac101e26, eb08b2ba-b751-11e8-b9f5-0a58ac10026c, eb0ad012-b751-11e8-b9f5-0a58ac10026c, f29aa349-b73e-11e8-848c-0a58ac101e26, f29d5b05-b73e-11e8-848c-0a58ac101e26, hook-47-6v5ll, rpm-mirror-fdd546546-9dctl, docker-registry-25-bvrgb, prometheus-k8s-0, telemeter-0; DaemonSet-managed pods (use --ignore-daemonsets to ignore): node-exporter-g67vc, service-cert-sync-rwk28, sync-bgxsr, ovs-kppxl, sdn-ssj6n

All of the pods with UIDs are terminal pods and should not even show up in this message.

Comment 2 ravig 2018-09-18 12:28:12 UTC
Upstream PR:

https://github.com/kubernetes/kubernetes/pull/68767

Comment 7 Sunil Choudhary 2019-07-15 12:38:23 UTC
# oc version
oc v3.11.128
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-sunilc-311lb-1:443
openshift v3.11.129
kubernetes v1.11.0+d4cacc0


# oc get pods -o wide
NAME                    READY     STATUS      RESTARTS   AGE       IP          NODE                  NOMINATED NODE
httpd-example-1-build   0/1       Completed   0          15m       10.2.6.10   qe-sunilc-311node-2   <none>
httpd-example-1-wvn54   1/1       Running     0          14m       10.2.6.13   qe-sunilc-311node-2   <none>
nginx-example-1-build   0/1       Completed   0          15m       10.2.6.11   qe-sunilc-311node-2   <none>
nginx-example-1-drw5t   1/1       Running     0          14m       10.2.6.12   qe-sunilc-311node-2   <none>
ruby-ex-1-hb6jr         1/1       Running     0          32m       10.2.12.8   qe-sunilc-311node-1   <none>

# oc adm drain qe-sunilc-311node-2 --ignore-daemonsets=true
node/qe-sunilc-311node-2 cordoned
WARNING: Ignoring DaemonSet-managed pods: node-exporter-qrmpz, sync-gpkws, ovs-nmj4t, sdn-6kzxj
pod/httpd-example-1-build evicted
pod/nginx-example-1-build evicted
pod/nginx-example-1-drw5t evicted
pod/httpd-example-1-wvn54 evicted

Comment 9 errata-xmlrpc 2019-07-23 19:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1753


Note You need to log in before you can comment on or make changes to this bug.