Note to QE: This BZ is made of fixes coming from cluster-api and MCO, it should be verified after the MCO patches go in (as the result of vendoring cluster-api into the MCO)
Verify steps for QE once this goes to MODIFIED: - spin up a cluster - upgrade to any newer payload from 4.3.0-0.okd in https://origin-release.svc.ci.openshift.org/ - watch it upgrades (it can take some more time with this bug) - once it's upgraded, verify that _every_ MCD pod doesn't show pods taking 10m (600s) to be evicted (example below) Example log with the bug: ... I0924 06:03:54.005617 32888 update.go:89] pod "prometheus-k8s-0" removed (evicted) I0924 06:03:54.806323 32888 update.go:89] pod "dp-9fcb69c69-zgz9m" removed (evicted) I0924 06:03:55.204997 32888 update.go:89] pod "alertmanager-main-2" removed (evicted) I0924 06:03:57.075300 32888 update.go:89] pod "prometheus-adapter-fcc9998f5-54p2h" removed (evicted) I0924 06:13:48.071299 32888 update.go:89] pod "downloads-5f75997b8b-wx9jm" removed (evicted) ... (check out the 10 minutes between prometheus-adapter pod and downloads)
Verified using 4.3.0-0.ci-2019-10-05-213736 ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.ci-2019-10-05-213736 True False 4m30s Cluster version is 4.3.0-0.ci-2019-10-05-213736 $ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://origin-release.svc.ci.openshift.org/graph"}}' --type=merge clusterversion.config.openshift.io/version patched $ oc adm upgrade --force --to-image registry.svc.ci.openshift.org/origin/release:4.3.0-0.okd-2019-10-05-233957 Updating to release image registry.svc.ci.openshift.org/origin/release:4.3.0-0.okd-2019-10-05-233957 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.ci-2019-10-05-213736 True True 13m Working towards 4.3.0-0.okd-2019-10-05-233957: 86% complete $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.okd-2019-10-05-233957 True False 5m14s Cluster version is 4.3.0-0.okd-2019-10-05-233957 ### Inspect node logs $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-96.ec2.internal Ready worker 53m v1.16.0-beta.2+1fba8e9b8 ip-10-0-134-126.ec2.internal Ready master 59m v1.16.0-beta.2+1fba8e9b8 ip-10-0-140-103.ec2.internal Ready worker 53m v1.16.0-beta.2+1fba8e9b8 ip-10-0-142-191.ec2.internal Ready master 60m v1.16.0-beta.2+1fba8e9b8 ip-10-0-151-200.ec2.internal Ready master 60m v1.16.0-beta.2+1fba8e9b8 ip-10-0-156-177.ec2.internal Ready worker 53m v1.16.0-beta.2+1fba8e9b8 ### Example worker node $ oc debug node/ip-10-0-129-96.ec2.internal Starting pod/ip-10-0-129-96ec2internal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.129.96 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -b | grep update.go <snip> Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:25.603186 31306 update.go:984] Update prepared; beginning drain Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:25.613485 31306 update.go:89] cordoned node "ip-10-0-129-96.ec2.internal" Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:25.677000 31306 update.go:93] ignoring DaemonSet-managed pods: tuned-hd4w7, dns-default-dlf7c, node-ca-klp2m, machine-config-daemon-fzxpw, node-exporter-jxdxk, multus-t2cvn, ovs-8xjlh, sdn-v6gz9; deleting pods with local storage: alertmanager-main-2, grafana-57d6fc6694-9mqt5, kube-state-metrics-7d9d74bc87-22kfj, prometheus-adapter-5585c785d-5wnxs, prometheus-k8s-1 Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:31.691659 31306 update.go:89] pod "router-default-5db6c5c78b-krh8b" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:32.888417 31306 update.go:89] pod "grafana-57d6fc6694-9mqt5" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:33.085490 31306 update.go:89] pod "openshift-state-metrics-5b487646c9-5hx5h" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:33.285142 31306 update.go:89] pod "kube-state-metrics-7d9d74bc87-22kfj" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:34.085337 31306 update.go:89] pod "telemeter-client-64cd646b7d-9q82l" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:34.486075 31306 update.go:89] pod "alertmanager-main-2" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:35.085534 31306 update.go:89] pod "prometheus-adapter-5585c785d-5wnxs" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:35.771674 31306 update.go:89] pod "downloads-846c4dc468-dwwwx" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.770773 31306 update.go:89] pod "prometheus-k8s-1" removed (evicted) Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.770824 31306 update.go:89] drained node "ip-10-0-129-96.ec2.internal" Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.770837 31306 update.go:984] drain complete Oct 06 02:40:09 ip-10-0-129-96 hyperkube[1332]: I1006 02:28:36.773636 31306 update.go:984] initiating reboot: Node will reboot into config rendered-worker-1ecac007ca00090c4776d1f903ae3c13 ### ^^^ ~10s from drain start to drain complete; ~1s for each pod to be evicted ### Example master node $ oc debug node/ip-10-0-134-126.ec2.internal Starting pod/ip-10-0-134-126ec2internal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.134.126 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -b | grep update.go <snip> Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:45.033640 27212 update.go:89] pod "machine-approver-85857fc9cc-vr6ch" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:45.431310 27212 update.go:89] pod "cluster-samples-operator-6c9bd886df-jjts4" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:45.625499 27212 update.go:89] pod "apiservice-cabundle-injector-59f56c5c7b-4svk5" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:46.423940 27212 update.go:89] pod "revision-pruner-7-ip-10-0-134-126.ec2.internal" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:48.023452 27212 update.go:89] pod "cloud-credential-operator-69987f75d5-jdw8q" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:48.688140 27212 update.go:89] pod "openshift-apiserver-operator-5b45db86b5-nmtj8" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:51.225761 27212 update.go:89] pod "dns-operator-78d5d78cd7-q6j27" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:51.424234 27212 update.go:89] pod "cluster-image-registry-operator-79f88868f7-b74nq" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:52.460225 27212 update.go:89] pod "machine-config-operator-584c66db64-lzwzp" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:52.668818 27212 update.go:89] pod "ingress-operator-7647cb4699-zjt5b" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:52.831734 27212 update.go:89] pod "catalog-operator-5b8c6d7b54-c9snm" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:53.668602 27212 update.go:89] pod "etcd-quorum-guard-6f6485c57b-wwqcv" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:54.225115 27212 update.go:89] pod "olm-operator-ffcfc494-qmth7" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:54.631607 27212 update.go:89] pod "network-operator-7b99784886-5ld24" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:55.828004 27212 update.go:89] pod "machine-api-controllers-558478f7fd-kcr2z" removed (evicted) Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:55.828070 27212 update.go:89] drained node "ip-10-0-134-126.ec2.internal" Oct 06 02:42:58 ip-10-0-134-126 hyperkube[1851]: I1006 02:22:55.828084 27212 update.go:984] drain complete ### ^^^ ~10s from drain start to drain complete; ~1s for each pod to be evicted ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062