1758343 – downloads pod doesn't respond to drain

Bug 1758343 - downloads pod doesn't respond to drain

Summary: downloads pod doesn't respond to drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1745772
TreeView+	depends on / blocked

Reported:	2019-10-03 21:14 UTC by Antonio Murdaca
Modified:	2020-05-13 21:26 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1745772
Environment:
Last Closed:	2020-05-13 21:26:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api pull 130	None	closed	Bug 1758343: pkg/drain: always honor pod termination timeout	2020-04-14 14:47:28 UTC
Github	openshift machine-config-operator pull 1153	None	closed	Bug 1758343: pkg/daemon: default drain grace period to -1	2020-04-14 14:47:28 UTC
Red Hat Product Errata	RHBA-2020:0062	None	None	None	2020-05-13 21:26:44 UTC

Comment 1 Antonio Murdaca 2019-10-04 07:31:05 UTC

Note to QE:

This BZ is made of fixes coming from cluster-api and MCO, it should be verified after the MCO patches go in (as the result of vendoring cluster-api into the MCO)

Comment 2 Antonio Murdaca 2019-10-04 08:45:36 UTC

Verify steps for QE once this goes to MODIFIED:

- spin up a cluster
- upgrade to any newer payload from 4.3.0-0.okd in https://origin-release.svc.ci.openshift.org/
- watch it upgrades (it can take some more time with this bug)
- once it's upgraded, verify that _every_ MCD pod doesn't show pods taking 10m (600s) to be evicted (example below)

Example log with the bug:

...
I0924 06:03:54.005617   32888 update.go:89] pod "prometheus-k8s-0" removed (evicted)
I0924 06:03:54.806323   32888 update.go:89] pod "dp-9fcb69c69-zgz9m" removed (evicted)
I0924 06:03:55.204997   32888 update.go:89] pod "alertmanager-main-2" removed (evicted)
I0924 06:03:57.075300   32888 update.go:89] pod "prometheus-adapter-fcc9998f5-54p2h" removed (evicted)
I0924 06:13:48.071299   32888 update.go:89] pod "downloads-5f75997b8b-wx9jm" removed (evicted)
...


(check out the 10 minutes between prometheus-adapter pod and downloads)

Comment 4 Micah Abbott 2019-10-06 02:50:52 UTC

Verified using 4.3.0-0.ci-2019-10-05-213736

```
$ oc get clusterversion                                                                                                                              
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                                    
version   4.3.0-0.ci-2019-10-05-213736   True        False         4m30s   Cluster version is 4.3.0-0.ci-2019-10-05-213736                                                                                           

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://origin-release.svc.ci.openshift.org/graph"}}' --type=merge                   
clusterversion.config.openshift.io/version patched                        

$ oc adm upgrade --force --to-image registry.svc.ci.openshift.org/origin/release:4.3.0-0.okd-2019-10-05-233957                                       
Updating to release image registry.svc.ci.openshift.org/origin/release:4.3.0-0.okd-2019-10-05-233957

$ oc get clusterversion                                                                                                                              
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                                    
version   4.3.0-0.ci-2019-10-05-213736   True        True          13m     Working towards 4.3.0-0.okd-2019-10-05-233957: 86% complete 

$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.okd-2019-10-05-233957   True        False         5m14s   Cluster version is 4.3.0-0.okd-2019-10-05-233957

### Inspect node logs

$ oc get  node
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-129-96.ec2.internal    Ready    worker   53m   v1.16.0-beta.2+1fba8e9b8
ip-10-0-134-126.ec2.internal   Ready    master   59m   v1.16.0-beta.2+1fba8e9b8
ip-10-0-140-103.ec2.internal   Ready    worker   53m   v1.16.0-beta.2+1fba8e9b8
ip-10-0-142-191.ec2.internal   Ready    master   60m   v1.16.0-beta.2+1fba8e9b8
ip-10-0-151-200.ec2.internal   Ready    master   60m   v1.16.0-beta.2+1fba8e9b8
ip-10-0-156-177.ec2.internal   Ready    worker   53m   v1.16.0-beta.2+1fba8e9b8

### Example worker node

$ oc debug node/ip-10-0-129-96.ec2.internal                                                                                                         
Starting pod/ip-10-0-129-96ec2internal-debug ...                                                                                                                                                                     
To use host binaries, run `chroot /host`                                                                                                                                                                             
Pod IP: 10.0.129.96                                                                                                                                                                                                  
If you don't see a command prompt, try pressing enter.                                                                                                                                                               
sh-4.2# chroot /host                                                                                                                                                                                                 
sh-4.4# journalctl -b | grep update.go 

<snip>
                                 
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:25.603186   31306 update.go:984] Update prepared; beginning drain
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:25.613485   31306 update.go:89] cordoned node "ip-10-0-129-96.ec2.internal"
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:25.677000   31306 update.go:93] ignoring DaemonSet-managed pods: tuned-hd4w7, dns-default-dlf7c, node-ca-klp2m, machine-config-daemon-fzxpw, node-exporter-jxdxk, multus-t2cvn, ovs-8xjlh, sdn-v6gz9; deleting pods with local storage: alertmanager-main-2, grafana-57d6fc6694-9mqt5, kube-state-metrics-7d9d74bc87-22kfj, prometheus-adapter-5585c785d-5wnxs, prometheus-k8s-1
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:31.691659   31306 update.go:89] pod "router-default-5db6c5c78b-krh8b" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:32.888417   31306 update.go:89] pod "grafana-57d6fc6694-9mqt5" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:33.085490   31306 update.go:89] pod "openshift-state-metrics-5b487646c9-5hx5h" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:33.285142   31306 update.go:89] pod "kube-state-metrics-7d9d74bc87-22kfj" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:34.085337   31306 update.go:89] pod "telemeter-client-64cd646b7d-9q82l" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:34.486075   31306 update.go:89] pod "alertmanager-main-2" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:35.085534   31306 update.go:89] pod "prometheus-adapter-5585c785d-5wnxs" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:35.771674   31306 update.go:89] pod "downloads-846c4dc468-dwwwx" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.770773   31306 update.go:89] pod "prometheus-k8s-1" removed (evicted)
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.770824   31306 update.go:89] drained node "ip-10-0-129-96.ec2.internal"
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.770837   31306 update.go:984] drain complete
Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.773636   31306 update.go:984] initiating reboot: Node will reboot into config rendered-worker-1ecac007ca00090c4776d1f903ae3c13

### ^^^ ~10s from drain start to drain complete; ~1s for each pod to be evicted


### Example master node

$ oc debug node/ip-10-0-134-126.ec2.internal
Starting pod/ip-10-0-134-126ec2internal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.134.126
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# journalctl -b | grep update.go

<snip>

Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:45.033640   27212 update.go:89] pod "machine-approver-85857fc9cc-vr6ch" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:45.431310   27212 update.go:89] pod "cluster-samples-operator-6c9bd886df-jjts4" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:45.625499   27212 update.go:89] pod "apiservice-cabundle-injector-59f56c5c7b-4svk5" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:46.423940   27212 update.go:89] pod "revision-pruner-7-ip-10-0-134-126.ec2.internal" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:48.023452   27212 update.go:89] pod "cloud-credential-operator-69987f75d5-jdw8q" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:48.688140   27212 update.go:89] pod "openshift-apiserver-operator-5b45db86b5-nmtj8" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:51.225761   27212 update.go:89] pod "dns-operator-78d5d78cd7-q6j27" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:51.424234   27212 update.go:89] pod "cluster-image-registry-operator-79f88868f7-b74nq" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:52.460225   27212 update.go:89] pod "machine-config-operator-584c66db64-lzwzp" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:52.668818   27212 update.go:89] pod "ingress-operator-7647cb4699-zjt5b" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:52.831734   27212 update.go:89] pod "catalog-operator-5b8c6d7b54-c9snm" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:53.668602   27212 update.go:89] pod "etcd-quorum-guard-6f6485c57b-wwqcv" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:54.225115   27212 update.go:89] pod "olm-operator-ffcfc494-qmth7" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:54.631607   27212 update.go:89] pod "network-operator-7b99784886-5ld24" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:55.828004   27212 update.go:89] pod "machine-api-controllers-558478f7fd-kcr2z" removed (evicted)
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:55.828070   27212 update.go:89] drained node "ip-10-0-134-126.ec2.internal"
Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:55.828084   27212 update.go:984] drain complete
  
### ^^^ ~10s from drain start to drain complete; ~1s for each pod to be evicted
```

Comment 6 errata-xmlrpc 2020-05-13 21:26:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.