1424678 – drain --force option does not perform as expected

Bug 1424678 - drain --force option does not perform as expected

Summary: drain --force option does not perform as expected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Maru Newby
QA Contact:	Weihua Meng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-18 01:33 UTC by Brennan Vincello
Modified:	2020-06-11 13:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: kubectl drain --force was ignoring any pods that indicated they were managed by a daemonset even if the managing daemonset was missing. Consequence: kubectl drain --force would fail with an error if orphaned daemonset pods were present on a node. Fix: kubectl drain --force detects when a daemonset pod is orphaned and warns about the missing daemonset rather than generating an error. Result: kubectl drain --force removes orphaned daemonset pods.
Clone Of:
Environment:
Last Closed:	2017-04-12 19:13:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 41864	'None'	closed	kubectl: Allow 'drain --force' to remove orphaned pods	2020-11-19 14:49:50 UTC
Origin (Github)	13123	None	None	None	2017-02-27 20:49:24 UTC
Red Hat Product Errata	RHBA-2017:0884	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Brennan Vincello 2017-02-18 01:33:29 UTC

Description of problem:

The --force option seems to misbehaving when there is a completely orphaned pod in the cluster (no rc, dc, or daemonset). The drain command with force option fails to remove orphaned pods, daemonset pods (like fluentd pods for example) that somehow get their DC/RC removed.

Version-Release number of selected component (if applicable):  3.4

How reproducible: As reproducible as creating orphaned pods.

Steps to Reproduce:
1. Deploy daemonset pods, like fluentd. Ensure DC/RC non-existent.
2. Run "drain" with the --force option.
3. Observe undrained orphan pod.

Actual results:

Normal pods removed, orphaned pods not removed.

Expected results:

Normal and orphaned pods removed.

Additional info:  None

Comment 1 Boris Kurktchiev 2017-02-20 13:45:23 UTC

just to clear up the above. Daemonset pods get cleaned up just fine, its the scenario where someone goes and deletes a DC through the UI and due to whatever conditions a pod ends up being left lying around, once that happens, drain fails and complains that it cannot find an RC for the pod (which doesnt exist since it was deleted), however based on the -h for drain --force should remove such pods.

Comment 2 Maru Newby 2017-02-20 23:12:20 UTC

(In reply to Boris Kurktchiev from comment #1)
> just to clear up the above. Daemonset pods get cleaned up just fine, its the
> scenario where someone goes and deletes a DC through the UI and due to
> whatever conditions a pod ends up being left lying around, once that
> happens, drain fails and complains that it cannot find an RC for the pod
> (which doesnt exist since it was deleted), however based on the -h for drain
> --force should remove such pods.

Your explanation is consistent with the code, and I don't see -h making a claim about orphaned pods. --force is intended to proceed with drain when a node is hosting pods that are not managed.  If a pod indicates it is managed (as an orphaned pod would) and the managing resource cannot be found, I'm not sure --force should delete it.  Having drain fail in that case gives the user a chance to detect that something serious is wrong.

Comment 3 Boris Kurktchiev 2017-02-21 15:17:19 UTC

SO here is what I am seeing when i view oc adm drain -h:

Examples:
  # Drain node "foo", even if there are pods not managed by a ReplicationController, ReplicaSet, Job, or DaemonSet on
it.
  $ oc adm drain foo --force

Reading the line above, it makes it seem as it would do exactly what I described.

root@osmaster0p:/etc/origin/master:
----> oc version
oc v3.4.0.39
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.cloudapps.unc.edu:443
openshift v3.4.0.39
kubernetes v1.4.0+776c994

I do not know if that text has changed in the latest 3.4.1.* release and if it has then ok, if not, then there is still a problem between what the code is doing and what the user is told should happen, according to the help text.

Comment 4 Maru Newby 2017-02-21 15:22:21 UTC

(In reply to Boris Kurktchiev from comment #3)
> SO here is what I am seeing when i view oc adm drain -h:
> 
> Examples:
>   # Drain node "foo", even if there are pods not managed by a
> ReplicationController, ReplicaSet, Job, or DaemonSet on
> it.
>   $ oc adm drain foo --force
> 
> Reading the line above, it makes it seem as it would do exactly what I
> described.

Why do you think it would remove orphans, when there is no mention of orphans in the text?

When orphans are detected, I think a user needs to figure out what to do with them.  I would expect a user to recreate the managing resource or run delete with --selector that targets the orphans, rather than blindly removing orphaned pods.

Comment 5 Boris Kurktchiev 2017-02-21 15:26:32 UTC

The text reads: Drain node "foo", even if there are pods not managed by a ReplicationController, ReplicaSet, Job, or DaemonSet on
it.

My scenario as described: Daemonset pods get cleaned up just fine, its the scenario where someone goes and deletes a DC through the UI and due to whatever conditions a pod ends up being left lying around, once that happens, drain fails and complains that it cannot find an RC for the pod

As I read the current text said pod above should be deleted. I am not saying you are wrong, your assertion that what happens is the correct behavior is right, what I am driving at is that I went in assuming something would happen based on what I read, so if anything the help text needs to reflect said behavior. Unless there is some way to make sure that users do NOT end up in the state described (pods lying around because their DC/RCs get deleted and system doesnt know what to do with them), I built a process around what I assumed the behavior of --force was going to be based on the information provided by -h.

Comment 7 Derek Carr 2017-02-23 16:16:13 UTC

UPSTREAM PR:
https://github.com/kubernetes/kubernetes/pull/41864

Comment 9 Troy Dawson 2017-03-03 16:40:54 UTC

I have no idea why this was moved to POST or MODIFIED.
Per comment 8, we still need this in release-1.5
Moving back to assigned.

Comment 10 Derek Carr 2017-03-03 17:03:31 UTC

The ORIGIN PR was here for release-1.5 and has merged.

https://github.com/openshift/origin/pull/13123

Comment 11 Troy Dawson 2017-03-03 17:48:42 UTC

This has been merged into ocp and is in OCP v3.5.0.38 or newer.

Comment 13 Weihua Meng 2017-03-06 06:40:15 UTC

Verified on openshift v3.5.0.39.
Fixed.
steps:
1. create a pod which is not managed by any controller. (pod created and running)
2. # oadm drain <nodeName> --force --delete-local-data
pod "hello-pod1" evicted
node "<nodeName>" drained
3. check pod and node status
# oc get pod hello-pod1
Error from server (NotFound): pods "hello-pod1" not found
# oc describe node <nodeName>
<---snip--->
Non-terminated Pods:		(0 in total)
<---snip--->

Comment 14 Weihua Meng 2017-03-06 07:54:58 UTC

Test steps:
1. create a RC.
# oc create -f rc.yaml
2. delete RC, keep the pod.
# oc delete rc <RCname> --cascade=false
3. oadm drain <nodeName> --force --delete-local-data
node "<nodeName>" drained
4. check node status
# oc describe node <nodeName>
<---snip--->
Non-terminated Pods:		(0 in total)
<---snip--->

Comment 16 errata-xmlrpc 2017-04-12 19:13:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.