Description of problem: A cluster has upgraded to v4.7.17. The control plane has upgraded successfully. We have observed two consecutive occasions of worker nodes not draining successfully, degrading the machine-config operator. Upon examination of the cluster, we found that the worker node being upgraded was not cordoned at all. There is no mention of any cordon attempts in the machine-config-operator log. The impacted worker node remains in this state and blocks the overall progress of the upgrade of the remaining workers in the machine config pool. Version-Release number of selected component (if applicable): v4.7.17 How reproducible: Has happened on two worker nodes in a row during this current upgrade.
Bug 1968019 was fixed for master/4.8 via mco@a29604e [1], and that all seems fine. Bug 1968759 was the 4.7 backport via mco@8bbec98 [2], picked up by 4.7.17 [3], and this bug seems to have snuck into the manual rebase. The 4.7 PR dropped some cordon code: $ git diff 8bbec98^..8bbec98 | grep -i cordon - err := drain.RunCordonOrUncordon(dn.drainer, dn.node, true) - glog.Infof("Cordon failed with: %v, retrying", err) The 4.8 PR did not: $ git diff a29604e^..a29604e | grep -i cordon @@ -45,34 +45,39 @@ func (dn *Daemon) cordonOrUncordonNode(desired bool) error { // We are here, that means we need to cordon and drain node Looking at CI, were we have a tightly controlled, consistent process and good asset gathering, I was hoping we'd have a good signal for "node got cordoned". Master contains Node has been successfully cordoned logging [4] , but it's from some single-node work that landed in master for 4.8 and was not backported [5], so we never had it in 4.7. But we can look at audit logs. Here's 4.6.13 -> 4.6.16 (so before the regression) [6]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1403169385064960000/artifacts/launch/audit-logs.tar | tar -xz --strip-components 1 $ zgrep -h '"resource":"nodes"' audit_logs/*/*.gz 2>/dev/null | jq -r 'select(.user.username == "system:serviceaccount:openshift-machine-config-operator:machine-config-daemon" and .verb == "patch" and .responseStatus.code == 200) | .stageTimestamp + " " + .objectRef.name' | grep master-0 | sort 2021-06-11T03:22:28.716863Z ci-ln-hrijm42-f76d1-rrqw5-master-0 2021-06-11T03:22:28.816802Z ci-ln-hrijm42-f76d1-rrqw5-master-0 2021-06-11T03:27:12.663633Z ci-ln-hrijm42-f76d1-rrqw5-master-0 2021-06-11T03:27:12.702060Z ci-ln-hrijm42-f76d1-rrqw5-master-0 So two patches pretty close together at 3:22:28. One of those was probably a cordon. And another pair at 3:27:12, one of which was probably an uncordon. Moving to 4.6.13 -> 4.6.17 (so including the regression) [7]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/audit-logs.tar | tar -xz --strip-components 1 $ zgrep -h '"resource":"nodes"' audit_logs/*/*.gz 2>/dev/null | jq -r 'select(.user.username == "system:serviceaccount:openshift-machine-config-operator:machine-config-daemon" and .verb == "patch" and .responseStatus.code == 200) | .stageTimestamp + " " + .objectRef.name' | grep master-0 | sort 2021-06-16T09:40:01.786752Z ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:51:16.450430Z ci-ln-vgg2rck-002ac-t945d-master-0 I'm not sure why we lost two, maybe because we don't have to uncordon if we weren't cordoned already? Anyhow, looking at the MCD logs for that node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods/openshift-machine-config-operator_machine-config-daemon-fxjsm_machine-config-daemon.log | grep 'Starting to manage node' I0616 09:30:46.631349 87085 update.go:1906] Starting to manage node: ci-ln-vgg2rck-002ac-t945d-master-0 I0616 09:51:10.749972 3101 update.go:1906] Starting to manage node: ci-ln-vgg2rck-002ac-t945d-master-0 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods/openshift-machine-config-operator_machine-config-daemon-fxjsm_machine-config-daemon.log | grep -B2 -A1 'Starting update from' I0616 09:30:47.744667 87085 daemon.go:1167] In desired config rendered-master-e2befb43e96b67ba330d84c2a0a7a603 I0616 09:40:01.787993 87085 update.go:593] Checking Reconcilable for config rendered-master-e2befb43e96b67ba330d84c2a0a7a603 to rendered-master-9ee8c4da7dbeec4f51b0a58c88e55aea I0616 09:40:01.825242 87085 update.go:1906] Starting update from rendered-master-e2befb43e96b67ba330d84c2a0a7a603 to rendered-master-9ee8c4da7dbeec4f51b0a58c88e55aea: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} I0616 09:40:01.841381 87085 update.go:1906] Update prepared; beginning drain So drain starts at 9:40:01, around the first patch, and ends at 9:51 when we come back post-reboot. Look for work scheduled during that window, when we should be cordoned: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/events.json | jq -r '.items[] | select(.reason == "Scheduled" and (.message | contains("master-0"))) | (.firstTimestamp // .metadata.creationTimestamp) + " " + .message' | grep 'T09:4[2-9]:' | sort 2021-06-16T09:46:33Z Successfully assigned openshift-cluster-storage-operator/csi-snapshot-controller-operator-6965fb996b-z2tzq to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:33Z Successfully assigned openshift-cluster-version/cluster-version-operator-677d8ffdd4-dnbtt to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:33Z Successfully assigned openshift-console-operator/console-operator-7b5b9c47d6-66ft7 to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:33Z Successfully assigned openshift-console/downloads-7c44f96fbf-xqrtn to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:34Z Successfully assigned openshift-operator-lifecycle-manager/packageserver-574c5ddd56-hftmm to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:36Z Successfully assigned openshift-etcd/etcd-quorum-guard-7c7768f8bb-r464j to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:57Z Successfully assigned openshift-oauth-apiserver/apiserver-5d6f768bf5-m85kd to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:47:53Z Successfully assigned openshift-apiserver/apiserver-84b454bf4-xxxzx to ci-ln-vgg2rck-002ac-t945d-master-0 Doesn't seem like we're cordoned. Picking on downloads: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods/openshift-machine-config-operator_machine-config-daemon-fxjsm_machine-config-daemon.log | grep downloads- I0616 09:40:02.703533 87085 daemon.go:328] evicting pod openshift-console/downloads-7c44f96fbf-kkjb9 I0616 09:40:21.826808 87085 daemon.go:314] Evicted pod openshift-console/downloads-7c44f96fbf-kkjb9 I0616 09:46:33.649381 87085 daemon.go:328] evicting pod openshift-console/downloads-7c44f96fbf-cqmvx I0616 09:46:46.083065 87085 daemon.go:314] Evicted pod openshift-console/downloads-7c44f96fbf-cqmvx So the MCD had to kick it off twice. And checking the tolerations: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods.json | jq -c '.items[] | select(.metadata.name | startswith("downloads-")).spec.tolerations[]' | sort | uniq {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120} {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120} {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"} I dunno why they want it to ignore memory-pressure on schedule, but there's no node.kubernetes.io/unschedulable exception that would allow it to schedule on a node that was actually cordoned. So... would be nice if 4.7 had a logged line around successful cordon like master does, but CI at least looks pretty suspicious for missing cordons in 4.7.17, and that meshes with my reading of the code changes. Suggested mitigation is to look for nodes where the machineconfiguration.openshift.io/currentConfig annotation doesn't match the machineconfiguration.openshift.io/desiredConfig, and cordon those nodes by hand. Although I haven't tested that out; you might also need to manually uncordon once they reboot. [1]: https://github.com/openshift/machine-config-operator/pull/2597 [2]: https://github.com/openshift/machine-config-operator/pull/2605 [3]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.7.17 [4]: https://github.com/openshift/machine-config-operator/blame/a29604e/pkg/daemon/drain.go#L92 [5]: https://github.com/openshift/machine-config-operator/pull/2457/commits/1261271ddbc0e3ab28bfc34be80f379db813528a#diff-58db213cf6f[…]f66f0b1e78a946R87 [6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1403169385064960000 [7]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056
Trevor: Fantastic debugging!
Indeed, great debugging Trevor. From the log provided, my analysis is as follows: 1. upgrade is stuck because one of the node took lot of time to drain. From pod machine-config-daemon-7bpbt/machine-config-daemon/machine-config-daemon/logs/current.log: MCD kept trying to drain rhsm-subscriptions-cron-hourly-tally-1622822400-n8wwm pod as it has been failing in multiple attempt ... 2021-06-16T16:18:57.511490296Z I0616 16:18:57.511440 741378 update.go:1906] Starting update from rendered-worker-60d93a47abd796e50f45bb4a5d10c1a0 to rendered-worker-06834005d5f3137c8a8671dd784f0833: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} 2021-06-16T16:18:57.526287340Z I0616 16:18:57.526243 741378 update.go:1906] Update prepared; beginning drain ... 2021-06-17T00:19:07.869189167Z E0617 00:19:07.869132 741378 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-110-163-219.ec2.internal after 1 hour ... 2021-06-17T00:45:19.678032427Z I0617 00:45:19.678017 741378 update.go:1906] drain complete 2021-06-17T00:45:19.680626718Z I0617 00:45:19.680593 741378 update.go:231] Successful drain took 1570.45894659 seconds ... Actual OS update started at 2021-06-16T16:18:57 but was stuck at drain until 2021-06-17T00:45:19 2. In the past we didn't have log for successfully cordoning of node. This has been added in 4.8 https://github.com/openshift/machine-config-operator/pull/2457/files#diff-58db213cf6f6259a621f8d3421b5e8af55ed71c9f52c15b255f66f0b1e78a946R87. 3. As Trevor figured it out, we need to bring back lost code during backport for cordonOrUncordonNode() and we can add in log message in 4.7 as well for "Successful cordoning"
created PR - https://github.com/openshift/machine-config-operator/pull/2623
Credit to Matt for pointing out that OSD's managed-upgrade operator wasn't stomping the slow-to-drain workload because the draining node wasn't cordoned :). All I did was try to follow that lead up in the code and CI.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-06-20-093308 True False 30m Cluster version is 4.7.0-0.nightly-2021-06-20-093308 $ cat pdb.yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict $ oc create -f pdb.yaml poddisruptionbudget.policy/dontevict created $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE dontevict 1 N/A 0 4s $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-143-129.us-west-2.compute.internal Ready master 52m v1.20.0+87cc9a4 ip-10-0-145-198.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4 ip-10-0-162-222.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4 ip-10-0-173-218.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4 ip-10-0-220-204.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4 ip-10-0-222-240.us-west-2.compute.internal Ready worker 42m v1.20.0+87cc9a4 $ oc get node/ip-10-0-222-240.us-west-2.compute.internal -o yaml | grep hostname kubernetes.io/hostname: ip-10-0-222-240 f:kubernetes.io/hostname: {} $ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-222-240"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h pod/dont-evict-this-pod created $ oc get pod NAME READY STATUS RESTARTS AGE dont-evict-this-pod 0/1 ContainerCreating 0 4s $ cat << EOF > file-ig3.yaml > apiVersion: machineconfiguration.openshift.io/v1 > kind: MachineConfig > metadata: > labels: > machineconfiguration.openshift.io/role: worker > name: test-file > spec: > config: > ignition: > version: 3.1.0 > storage: > files: > - contents: > source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK > filesystem: root > mode: 0644 > path: /etc/test > EOF $ oc create file -f file-ig3.yaml error: Unexpected args: [file] See 'oc create -h' for help and examples. $ oc create -f file-ig3.yaml machineconfig.machineconfiguration.openshift.io/test-file created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 00-worker 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-master-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-master-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-worker-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-worker-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-master-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-master-ssh 3.2.0 62m 99-worker-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-worker-ssh 3.2.0 62m rendered-master-13a8e238f99c2aae13c24eac159a2db2 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m rendered-worker-96cc244920a4ac522616e39024e1d35d 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m test-file 3.1.0 3s $ oc get pods -A --field-selector spec.nodeName=ip-10-0-222-240.us-west-2.compute.internal | grep machine-config-daemon openshift-machine-config-operator machine-config-daemon-6gkv4 2/2 Running 0 45m $ oc -n openshift-machine-config-operator logs -f machine-config-daemon-6gkv4 -c machine-config-daemon | grep -5 'Node has been successfully' I0621 13:25:27.562892 1747 update.go:1943] completed update for config rendered-worker-96cc244920a4ac522616e39024e1d35d I0621 13:25:27.566468 1747 daemon.go:1167] In desired config rendered-worker-96cc244920a4ac522616e39024e1d35d I0621 14:13:52.569029 1747 update.go:630] Checking Reconcilable for config rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46 I0621 14:13:52.616935 1747 update.go:1943] Starting update from rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false} I0621 14:13:52.664672 1747 update.go:542] File diff: /etc/test was added I0621 14:13:52.676667 1747 update.go:1943] Node has been successfully cordoned I0621 14:13:52.680568 1747 update.go:1943] Update prepared; beginning drain E0621 14:13:52.945865 1747 daemon.go:328] WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/dont-evict-this-pod, openshift-marketplace/certified-operators-jxlc7, openshift-marketplace/community-operators-jt2cn, openshift-marketplace/redhat-marketplace-5wf9z, openshift-marketplace/redhat-operators-t5jwl; ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-668pp, openshift-cluster-node-tuning-operator/tuned-c2m87, openshift-dns/dns-default-2q4wc, openshift-image-registry/node-ca-jzgwx, openshift-ingress-canary/ingress-canary-hsqns, openshift-machine-config-operator/machine-config-daemon-6gkv4, openshift-monitoring/node-exporter-qxnl4, openshift-multus/multus-8g2g2, openshift-multus/network-metrics-daemon-9rshb, openshift-network-diagnostics/network-check-target-mp7cg, openshift-sdn/sdn-mqw9g I0621 14:13:52.951733 1747 daemon.go:328] evicting pod openshift-image-registry/image-registry-6c75544c-6sqw4 I0621 14:13:52.951744 1747 daemon.go:328] evicting pod openshift-ingress/router-default-6fd598466c-qb8jr I0621 14:13:52.951733 1747 daemon.go:328] evicting pod openshift-network-diagnostics/network-check-source-575c4794b8-7b4tm -- I0621 15:13:28.922201 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying E0621 15:13:52.683567 1747 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-0-222-240.us-west-2.compute.internal after 1 hour I0621 15:13:52.711672 1747 update.go:630] Checking Reconcilable for config rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46 I0621 15:13:52.741981 1747 update.go:1943] Starting update from rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false} I0621 15:13:52.770285 1747 update.go:542] File diff: /etc/test was added I0621 15:13:52.770394 1747 update.go:1943] Node has been successfully cordoned I0621 15:13:52.773180 1747 update.go:1943] Update prepared; beginning drain E0621 15:13:53.006501 1747 daemon.go:328] WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/dont-evict-this-pod; ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-668pp, openshift-cluster-node-tuning-operator/tuned-c2m87, openshift-dns/dns-default-2q4wc, openshift-image-registry/node-ca-jzgwx, openshift-ingress-canary/ingress-canary-hsqns, openshift-machine-config-operator/machine-config-daemon-6gkv4, openshift-monitoring/node-exporter-qxnl4, openshift-multus/multus-8g2g2, openshift-multus/network-metrics-daemon-9rshb, openshift-network-diagnostics/network-check-target-mp7cg, openshift-sdn/sdn-mqw9g I0621 15:13:53.009706 1747 daemon.go:328] evicting pod default/dont-evict-this-pod E0621 15:13:53.017078 1747 daemon.go:328] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0621 15:13:58.017197 1747 daemon.go:328] evicting pod default/dont-evict-this-pod
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502