1810400 – [OSP] Failed to drain node for machine

Bug 1810400 - [OSP] Failed to drain node for machine

Summary: [OSP] Failed to drain node for machine

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	egarcia
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1811400 (view as bug list)
Depends On:
Blocks:	1832999
TreeView+	depends on / blocked

Reported:	2020-03-05 07:21 UTC by sunzhaohua
Modified:	2020-07-13 17:18 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1832999 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:18:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 90	0	None	closed	Bug 1810400: Port CAPO to MAO API	2021-01-28 17:36:33 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:18:44 UTC

Description sunzhaohua 2020-03-05 07:21:04 UTC

Description of problem:
Ipi on osp, when node that has a pod with local storage is unreachable,  drain node failed

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-03-175205

How reproducible:
Always

Steps to Reproduce:
1. Stop kubelet on a worker node that has a pod with local storage/data, such as alert-monitor-1 in namespace openshift-monitoring
2. Drain the node 
$ oc adm drain zhsun9-x9smf-worker-gc2pb --ignore-daemonsets --delete-local-data
node/zhsun9-x9smf-worker-gc2pb cordoned
WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-qhc69, openshift-dns/dns-default-brjwk, openshift-image-registry/node-ca-pvwk6, openshift-machine-config-operator/machine-config-daemon-kcgvg, openshift-monitoring/node-exporter-h8tp4, openshift-multus/multus-qdlwn, openshift-sdn/ovs-ct6ft, openshift-sdn/sdn-6tdvb
evicting pod "csi-snapshot-controller-operator-6f959d8749-f4vzp"
evicting pod "alertmanager-main-2"
evicting pod "openshift-state-metrics-58dddb6776-6p87v"
evicting pod "csi-snapshot-controller-6cccfcb78b-lvmzm"
evicting pod "grafana-7dd7875c69-sddzn"
evicting pod "router-default-6d84d84994-8j86r"
evicting pod "migrator-6b788867c8-59m8c"
evicting pod "kube-state-metrics-95445d6d7-fm6kf"
evicting pod "redhat-marketplace-9c7d9ffd4-4t9hd"
evicting pod "alertmanager-main-0"
evicting pod "alertmanager-main-1"
evicting pod "prometheus-k8s-0"
evicting pod "redhat-operators-668f9c4f6c-mtr25"
evicting pod "prometheus-adapter-6f949b8994-7mz2s"
evicting pod "prometheus-k8s-1"
evicting pod "image-registry-6b7dc776db-vwzrq"
evicting pod "thanos-querier-678f7fbcbf-cl6fp"

3. Delete the machine associated with that node
4. check logs 

Actual results:
After 2 hours machine still stuck in Deleting status.
$ oc get machine
NAME                          PHASE      TYPE        REGION      ZONE   AGE
zhsun9-x9smf-master-0         Running    m1.xlarge   regionOne   nova   29h
zhsun9-x9smf-master-1         Running    m1.xlarge   regionOne   nova   29h
zhsun9-x9smf-master-2         Running    m1.xlarge   regionOne   nova   29h
zhsun9-x9smf-worker-2-tl8bp   Running    m1.xlarge   regionOne   nova   21h
zhsun9-x9smf-worker-cxddw     Running    m1.xlarge   regionOne   nova   16h
zhsun9-x9smf-worker-gc2pb     Deleting   m1.xlarge   regionOne   nova   28h
zhsun9-x9smf-worker-n6xgw     Running    m1.xlarge   regionOne   nova   126m
zhsun9-x9smf-worker-smf2x     Running    m1.xlarge   regionOne   nova   19h
zhsun9-x9smf-worker-wr7dd     Running    m1.xlarge   regionOne   nova   15h

I0305 03:43:46.044197       1 controller.go:164] Reconciling Machine "zhsun9-x9smf-worker-gc2pb"
I0305 03:43:46.044285       1 controller.go:376] Machine "zhsun9-x9smf-worker-gc2pb" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0305 03:43:46.066928       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0305 03:43:46.737975       1 controller.go:284] Reconciling machine "zhsun9-x9smf-worker-gc2pb" triggers idempotent update
I0305 03:51:04.258108       1 controller.go:164] Reconciling Machine "zhsun9-x9smf-worker-gc2pb"
I0305 03:51:04.258160       1 controller.go:376] Machine "zhsun9-x9smf-worker-gc2pb" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
......
I0305 06:21:41.865370       1 info.go:20] unable to drain node "zhsun9-x9smf-worker-gc2pb"
I0305 06:21:41.865378       1 info.go:20] there are pending nodes to be drained: zhsun9-x9smf-worker-gc2pb
W0305 06:21:41.865390       1 controller.go:364] drain failed for machine "zhsun9-x9smf-worker-gc2pb": [error when waiting for pod "csi-snapshot-controller-operator-6f959d8749-f4vzp" terminating: timed out waiting for the condition, error when waiting for pod "image-registry-6b7dc776db-vwzrq" terminating: timed out waiting for the condition, error when waiting for pod "router-default-6d84d84994-8j86r" terminating: timed out waiting for the condition, error when waiting for pod "csi-snapshot-controller-6cccfcb78b-lvmzm" terminating: timed out waiting for the condition, error when waiting for pod "redhat-marketplace-9c7d9ffd4-4t9hd" terminating: timed out waiting for the condition, error when waiting for pod "redhat-operators-668f9c4f6c-mtr25" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-0" terminating: timed out waiting for the condition, error when waiting for pod "prometheus-k8s-1" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-1" terminating: timed out waiting for the condition, error when waiting for pod "thanos-querier-678f7fbcbf-cl6fp" terminating: timed out waiting for the condition, error when waiting for pod "migrator-6b788867c8-59m8c" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-2" terminating: timed out waiting for the condition, error when waiting for pod "openshift-state-metrics-58dddb6776-6p87v" terminating: timed out waiting for the condition, error when waiting for pod "prometheus-k8s-0" terminating: timed out waiting for the condition, error when waiting for pod "prometheus-adapter-6f949b8994-7mz2s" terminating: timed out waiting for the condition, error when waiting for pod "grafana-7dd7875c69-sddzn" terminating: timed out waiting for the condition, error when waiting for pod "kube-state-metrics-95445d6d7-fm6kf" terminating: timed out waiting for the condition]
E0305 06:21:41.865425       1 controller.go:235] Failed to drain node for machine "zhsun9-x9smf-worker-gc2pb": requeue in: 20s
I0305 06:21:41.865443       1 controller.go:409] Actuator returned requeue-after error: requeue in: 20s

$ oc get po --all-namespaces -o wide | grep Terminating
openshift-csi-snapshot-controller-operator              csi-snapshot-controller-operator-6f959d8749-f4vzp                 1/1     Terminating   0          29h     10.131.0.4     zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-csi-snapshot-controller                       csi-snapshot-controller-6cccfcb78b-lvmzm                          1/1     Terminating   63         28h     10.131.0.16    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-image-registry                                image-registry-6b7dc776db-vwzrq                                   1/1     Terminating   0          28h     10.131.0.11    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-ingress                                       router-default-6d84d84994-8j86r                                   1/1     Terminating   0          28h     192.168.0.24   zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-kube-storage-version-migrator                 migrator-6b788867c8-59m8c                                         1/1     Terminating   0          28h     10.131.0.12    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-marketplace                                   redhat-marketplace-9c7d9ffd4-4t9hd                                1/1     Terminating   0          28h     10.131.0.14    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-marketplace                                   redhat-operators-668f9c4f6c-mtr25                                 1/1     Terminating   0          28h     10.131.0.9     zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    alertmanager-main-0                                               3/3     Terminating   0          28h     10.131.0.29    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    alertmanager-main-1                                               3/3     Terminating   0          28h     10.131.0.28    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    alertmanager-main-2                                               3/3     Terminating   0          28h     10.131.0.24    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    grafana-7dd7875c69-sddzn                                          2/2     Terminating   0          28h     10.131.0.25    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    kube-state-metrics-95445d6d7-fm6kf                                3/3     Terminating   0          28h     10.131.0.6     zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    openshift-state-metrics-58dddb6776-6p87v                          3/3     Terminating   0          28h     10.131.0.5     zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    prometheus-adapter-6f949b8994-7mz2s                               1/1     Terminating   0          5h12m   10.131.0.43    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    prometheus-k8s-0                                                  7/7     Terminating   1          28h     10.131.0.30    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    prometheus-k8s-1                                                  7/7     Terminating   1          19h     10.131.0.38    zhsun9-x9smf-worker-gc2pb     <none>           <none>
openshift-monitoring                                    thanos-querier-678f7fbcbf-cl6fp                                   4/4     Terminating   0          27h     10.131.0.31    zhsun9-x9smf-worker-gc2pb     <none>           <none>


Expected results:
Drain node successful, machine could be deleted.

Additional info:

Comment 1 Alberto 2020-03-05 08:19:29 UTC

This is probably just honouring the PDB which is expected. You can signal admin intent to force delete by setting the annotation `machine.openshift.io/exclude-node-draining: ""` on the machine.
I would expect the log message to mention the PDB or to more specific though so leaving this open to dig into the logs.

Comment 2 sunzhaohua 2020-03-09 10:50:31 UTC

@alberto This bug is similar wtih this one, https://bugzilla.redhat.com/show_bug.cgi?id=1733474 it is because this node I drained has pods with local storage. In aws, gcp, azure, no this issue.  The same steps, only osp fails.

Comment 3 egarcia 2020-03-12 19:46:18 UTC

@sunzhaohua Would you mind trying this again with a newer version of the installer? My understanding of the scenario is that this has been fixed in cluster-api and should no longer be a problem.

Comment 4 sunzhaohua 2020-03-13 06:39:32 UTC

@egarcia  I tried again, machine still gets stuck in deleting status for about 2 hours

$  oc adm drain zhsun78-jw9rt-worker-z7gfp --ignore-daemonsets --delete-local-data
node/zhsun78-jw9rt-worker-z7gfp cordoned
evicting pod "alertmanager-main-2"
evicting pod "router-default-646f7d7959-fhf2z"
evicting pod "alertmanager-main-1"
evicting pod "prometheus-adapter-7ccc9d5c6d-mdc5c"
evicting pod "grafana-74d584c555-5llss"
evicting pod "prometheus-k8s-0"
evicting pod "telemeter-client-7cc8d5d84f-dfhch"
evicting pod "thanos-querier-6cd75c56f5-ws5zz"
pod/alertmanager-main-1 evicted
E0313 14:08:45.060297   10670 request.go:924] Unexpected error when reading response body: read tcp 10.72.13.139:54482->10.0.101.20:6443: read: invalid argument
pod/router-default-646f7d7959-fhf2z evicted
pod/grafana-74d584c555-5llss evicted
pod/prometheus-adapter-7ccc9d5c6d-mdc5c evicted
pod/alertmanager-main-2 evicted
pod/telemeter-client-7cc8d5d84f-dfhch evicted
pod/thanos-querier-6cd75c56f5-ws5zz evicted
pod/prometheus-k8s-0 evicted
node/zhsun78-jw9rt-worker-z7gfp evicted

$ oc get pod -o wide --all-namespaces | grep Terminating
openshift-ingress                                       router-default-646f7d7959-fhf2z                                   1/1     Terminating   0          4h40m   192.168.0.22   zhsun78-jw9rt-worker-z7gfp   <none>           <none>
openshift-monitoring                                    alertmanager-main-1                                               3/3     Terminating   0          4h32m   10.128.2.10    zhsun78-jw9rt-worker-z7gfp   <none>           <none>
openshift-monitoring                                    alertmanager-main-2                                               3/3     Terminating   0          4h32m   10.128.2.6     zhsun78-jw9rt-worker-z7gfp   <none>           <none>
openshift-monitoring                                    grafana-74d584c555-5llss                                          2/2     Terminating   0          4h32m   10.128.2.9     zhsun78-jw9rt-worker-z7gfp   <none>           <none>
openshift-monitoring                                    prometheus-adapter-7ccc9d5c6d-mdc5c                               1/1     Terminating   0          4h32m   10.128.2.5     zhsun78-jw9rt-worker-z7gfp   <none>           <none>
openshift-monitoring                                    prometheus-k8s-0                                                  7/7     Terminating   1          4h31m   10.128.2.13    zhsun78-jw9rt-worker-z7gfp   <none>           <none>
openshift-monitoring                                    telemeter-client-7cc8d5d84f-dfhch                                 3/3     Terminating   0          4h31m   10.128.2.11    zhsun78-jw9rt-worker-z7gfp   <none>           <none>
openshift-monitoring                                    thanos-querier-6cd75c56f5-ws5zz                                   4/4     Terminating   0          4h31m   10.128.2.12    zhsun78-jw9rt-worker-z7gfp   <none>           <none>
$ oc get node
NAME                         STATUS                        ROLES    AGE     VERSION
zhsun78-jw9rt-master-0       Ready                         master   4h49m   v1.17.1
zhsun78-jw9rt-master-1       Ready                         master   4h49m   v1.17.1
zhsun78-jw9rt-master-2       Ready                         master   4h49m   v1.17.1
zhsun78-jw9rt-worker-9v2hb   Ready                         worker   4h31m   v1.17.1
zhsun78-jw9rt-worker-jqj26   Ready                         worker   4h39m   v1.17.1
zhsun78-jw9rt-worker-z28r2   Ready                         worker   135m    v1.17.1
zhsun78-jw9rt-worker-z7gfp   NotReady,SchedulingDisabled   worker   4h37m   v1.17.1


I0313 04:01:14.193862       1 controller.go:164] Reconciling Machine "zhsun78-jw9rt-worker-z7gfp"
I0313 04:01:14.193979       1 controller.go:376] Machine "zhsun78-jw9rt-worker-z7gfp" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0313 04:01:14.270598       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0313 04:01:14.817282       1 controller.go:284] Reconciling machine "zhsun78-jw9rt-worker-z7gfp" triggers idempotent update


I0313 06:32:27.089538       1 info.go:16] ignoring DaemonSet-managed pods: tuned-4ppm4, dns-default-qp8p2, node-ca-cnjmq, machine-config-daemon-blkdh, node-exporter-9772q, multus-q2bjt, ovs-phcds, sdn-l9fpn; deleting pods with local storage: alertmanager-main-1, alertmanager-main-2, grafana-74d584c555-5llss, prometheus-adapter-7ccc9d5c6d-mdc5c, prometheus-k8s-0
I0313 06:32:27.089573       1 info.go:20] failed to evict pods from node "zhsun78-jw9rt-worker-z7gfp" (pending pods: alertmanager-main-1,alertmanager-main-2,grafana-74d584c555-5llss,prometheus-adapter-7ccc9d5c6d-mdc5c,prometheus-k8s-0,router-default-646f7d7959-fhf2z,telemeter-client-7cc8d5d84f-dfhch,thanos-querier-6cd75c56f5-ws5zz): [error when waiting for pod "prometheus-adapter-7ccc9d5c6d-mdc5c" terminating: timed out waiting for the condition, error when waiting for pod "grafana-74d584c555-5llss" terminating: timed out waiting for the condition, error when waiting for pod "thanos-querier-6cd75c56f5-ws5zz" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-1" terminating: timed out waiting for the condition, error when waiting for pod "router-default-646f7d7959-fhf2z" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-2" terminating: timed out waiting for the condition, error when waiting for pod "telemeter-client-7cc8d5d84f-dfhch" terminating: timed out waiting for the condition, error when waiting for pod "prometheus-k8s-0" terminating: timed out waiting for the condition]
I0313 06:32:27.089606       1 info.go:16] [error when waiting for pod "prometheus-adapter-7ccc9d5c6d-mdc5c" terminating: timed out waiting for the condition, error when waiting for pod "grafana-74d584c555-5llss" terminating: timed out waiting for the condition, error when waiting for pod "thanos-querier-6cd75c56f5-ws5zz" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-1" terminating: timed out waiting for the condition, error when waiting for pod "router-default-646f7d7959-fhf2z" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-2" terminating: timed out waiting for the condition, error when waiting for pod "telemeter-client-7cc8d5d84f-dfhch" terminating: timed out waiting for the condition, error when waiting for pod "prometheus-k8s-0" terminating: timed out waiting for the condition]
I0313 06:32:27.089616       1 info.go:20] unable to drain node "zhsun78-jw9rt-worker-z7gfp"
I0313 06:32:27.089621       1 info.go:20] there are pending nodes to be drained: zhsun78-jw9rt-worker-z7gfp
W0313 06:32:27.089627       1 controller.go:364] drain failed for machine "zhsun78-jw9rt-worker-z7gfp": [error when waiting for pod "prometheus-adapter-7ccc9d5c6d-mdc5c" terminating: timed out waiting for the condition, error when waiting for pod "grafana-74d584c555-5llss" terminating: timed out waiting for the condition, error when waiting for pod "thanos-querier-6cd75c56f5-ws5zz" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-1" terminating: timed out waiting for the condition, error when waiting for pod "router-default-646f7d7959-fhf2z" terminating: timed out waiting for the condition, error when waiting for pod "alertmanager-main-2" terminating: timed out waiting for the condition, error when waiting for pod "telemeter-client-7cc8d5d84f-dfhch" terminating: timed out waiting for the condition, error when waiting for pod "prometheus-k8s-0" terminating: timed out waiting for the condition]
E0313 06:32:27.089646       1 controller.go:235] Failed to drain node for machine "zhsun78-jw9rt-worker-z7gfp": requeue in: 20s
I0313 06:32:27.089663       1 controller.go:409] Actuator returned requeue-after error: requeue in: 20s
I0313 06:32:47.089897       1 controller.go:164] Reconciling Machine "zhsun78-jw9rt-worker-z7gfp"
I0313 06:32:47.090025       1 controller.go:376] Machine "zhsun78-jw9rt-worker-z7gfp" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0313 06:32:47.090067       1 controller.go:227] Reconciling machine "zhsun78-jw9rt-worker-z7gfp" triggers delete
I0313 06:32:47.173906       1 info.go:16] ignoring DaemonSet-managed pods: tuned-4ppm4, dns-default-qp8p2, node-ca-cnjmq, machine-config-daemon-blkdh, node-exporter-9772q, multus-q2bjt, ovs-phcds, sdn-l9fpn; deleting pods with local storage: alertmanager-main-1, alertmanager-main-2, grafana-74d584c555-5llss, prometheus-adapter-7ccc9d5c6d-mdc5c, prometheus-k8s-0

Comment 5 egarcia 2020-03-20 19:33:31 UTC

Hmm, I will look a little further into this. What version of the installer did you get these results with? Also, do you have any custom scripts that might be deleting servers?

Comment 8 Michael Gugino 2020-04-23 20:41:21 UTC

There are possibly 2 or more issues.

The first issue is, the actual VM disappeared from openstack before the node was drained.  This should never happen (unless you are intentionally skipping drain, which isn't the case here).

Second, the openstack provider must be using outdated version of the machine-api.  We have a work around in place that will ignore an unresponsive node (kubelet not communicating) that will go ahead and skip any pods that already have a deletion timestamp.  This does not appear to be working in this bug and is not specific to any one provider.

Third, there may be something that is causing nodes on openstack to go unready/unreachable when draining.  This could be because of the first issue we talked about (actual VM already gone) or it could be because some pod that is required for the kubelet to communicate is gone (such as a proxy or other such thing) and it's being drained and then effectively blocking comms.  If openstack has such a thing, it should run as a daemonset.

The first step is to update your cluster-api dependencies in each release branch.  In 4.3 and newer, we have moved cluster-api library code into the machine-api-operator repo.

Here's what we're doing for the AWS provider in 4.4: https://github.com/openshift/cluster-api-provider-aws/blob/release-4.4/go.mod#L13

4.3: https://github.com/openshift/cluster-api-provider-aws/blob/release-4.3/go.mod#L13

Comment 11 David Sanz 2020-05-05 11:43:15 UTC

Verified on 4.5.0-0.nightly-2020-05-04-113741

"oc delete machine" causes node drained correctly, node, machine and openstack instance deleted

Comment 12 egarcia 2020-05-07 15:49:26 UTC

*** Bug 1811400 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2020-07-13 17:18:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.