Bug 1973006
Summary: | Worker nodes are not being cordoned for upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Matt Bargenquast <mbargenq> |
Component: | Machine Config Operator | Assignee: | Sinny Kumari <skumari> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.7 | CC: | bleanhar, jerzhang, skumari, smilner, wking |
Target Milestone: | --- | Keywords: | Regression, ServiceDeliveryImpact, Upgrades |
Target Release: | 4.7.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-06-29 04:20:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1968019 | ||
Bug Blocks: |
Description
Matt Bargenquast
2021-06-17 05:26:08 UTC
Bug 1968019 was fixed for master/4.8 via mco@a29604e [1], and that all seems fine. Bug 1968759 was the 4.7 backport via mco@8bbec98 [2], picked up by 4.7.17 [3], and this bug seems to have snuck into the manual rebase. The 4.7 PR dropped some cordon code: $ git diff 8bbec98^..8bbec98 | grep -i cordon - err := drain.RunCordonOrUncordon(dn.drainer, dn.node, true) - glog.Infof("Cordon failed with: %v, retrying", err) The 4.8 PR did not: $ git diff a29604e^..a29604e | grep -i cordon @@ -45,34 +45,39 @@ func (dn *Daemon) cordonOrUncordonNode(desired bool) error { // We are here, that means we need to cordon and drain node Looking at CI, were we have a tightly controlled, consistent process and good asset gathering, I was hoping we'd have a good signal for "node got cordoned". Master contains Node has been successfully cordoned logging [4] , but it's from some single-node work that landed in master for 4.8 and was not backported [5], so we never had it in 4.7. But we can look at audit logs. Here's 4.6.13 -> 4.6.16 (so before the regression) [6]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1403169385064960000/artifacts/launch/audit-logs.tar | tar -xz --strip-components 1 $ zgrep -h '"resource":"nodes"' audit_logs/*/*.gz 2>/dev/null | jq -r 'select(.user.username == "system:serviceaccount:openshift-machine-config-operator:machine-config-daemon" and .verb == "patch" and .responseStatus.code == 200) | .stageTimestamp + " " + .objectRef.name' | grep master-0 | sort 2021-06-11T03:22:28.716863Z ci-ln-hrijm42-f76d1-rrqw5-master-0 2021-06-11T03:22:28.816802Z ci-ln-hrijm42-f76d1-rrqw5-master-0 2021-06-11T03:27:12.663633Z ci-ln-hrijm42-f76d1-rrqw5-master-0 2021-06-11T03:27:12.702060Z ci-ln-hrijm42-f76d1-rrqw5-master-0 So two patches pretty close together at 3:22:28. One of those was probably a cordon. And another pair at 3:27:12, one of which was probably an uncordon. Moving to 4.6.13 -> 4.6.17 (so including the regression) [7]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/audit-logs.tar | tar -xz --strip-components 1 $ zgrep -h '"resource":"nodes"' audit_logs/*/*.gz 2>/dev/null | jq -r 'select(.user.username == "system:serviceaccount:openshift-machine-config-operator:machine-config-daemon" and .verb == "patch" and .responseStatus.code == 200) | .stageTimestamp + " " + .objectRef.name' | grep master-0 | sort 2021-06-16T09:40:01.786752Z ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:51:16.450430Z ci-ln-vgg2rck-002ac-t945d-master-0 I'm not sure why we lost two, maybe because we don't have to uncordon if we weren't cordoned already? Anyhow, looking at the MCD logs for that node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods/openshift-machine-config-operator_machine-config-daemon-fxjsm_machine-config-daemon.log | grep 'Starting to manage node' I0616 09:30:46.631349 87085 update.go:1906] Starting to manage node: ci-ln-vgg2rck-002ac-t945d-master-0 I0616 09:51:10.749972 3101 update.go:1906] Starting to manage node: ci-ln-vgg2rck-002ac-t945d-master-0 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods/openshift-machine-config-operator_machine-config-daemon-fxjsm_machine-config-daemon.log | grep -B2 -A1 'Starting update from' I0616 09:30:47.744667 87085 daemon.go:1167] In desired config rendered-master-e2befb43e96b67ba330d84c2a0a7a603 I0616 09:40:01.787993 87085 update.go:593] Checking Reconcilable for config rendered-master-e2befb43e96b67ba330d84c2a0a7a603 to rendered-master-9ee8c4da7dbeec4f51b0a58c88e55aea I0616 09:40:01.825242 87085 update.go:1906] Starting update from rendered-master-e2befb43e96b67ba330d84c2a0a7a603 to rendered-master-9ee8c4da7dbeec4f51b0a58c88e55aea: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} I0616 09:40:01.841381 87085 update.go:1906] Update prepared; beginning drain So drain starts at 9:40:01, around the first patch, and ends at 9:51 when we come back post-reboot. Look for work scheduled during that window, when we should be cordoned: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/events.json | jq -r '.items[] | select(.reason == "Scheduled" and (.message | contains("master-0"))) | (.firstTimestamp // .metadata.creationTimestamp) + " " + .message' | grep 'T09:4[2-9]:' | sort 2021-06-16T09:46:33Z Successfully assigned openshift-cluster-storage-operator/csi-snapshot-controller-operator-6965fb996b-z2tzq to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:33Z Successfully assigned openshift-cluster-version/cluster-version-operator-677d8ffdd4-dnbtt to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:33Z Successfully assigned openshift-console-operator/console-operator-7b5b9c47d6-66ft7 to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:33Z Successfully assigned openshift-console/downloads-7c44f96fbf-xqrtn to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:34Z Successfully assigned openshift-operator-lifecycle-manager/packageserver-574c5ddd56-hftmm to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:36Z Successfully assigned openshift-etcd/etcd-quorum-guard-7c7768f8bb-r464j to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:46:57Z Successfully assigned openshift-oauth-apiserver/apiserver-5d6f768bf5-m85kd to ci-ln-vgg2rck-002ac-t945d-master-0 2021-06-16T09:47:53Z Successfully assigned openshift-apiserver/apiserver-84b454bf4-xxxzx to ci-ln-vgg2rck-002ac-t945d-master-0 Doesn't seem like we're cordoned. Picking on downloads: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods/openshift-machine-config-operator_machine-config-daemon-fxjsm_machine-config-daemon.log | grep downloads- I0616 09:40:02.703533 87085 daemon.go:328] evicting pod openshift-console/downloads-7c44f96fbf-kkjb9 I0616 09:40:21.826808 87085 daemon.go:314] Evicted pod openshift-console/downloads-7c44f96fbf-kkjb9 I0616 09:46:33.649381 87085 daemon.go:328] evicting pod openshift-console/downloads-7c44f96fbf-cqmvx I0616 09:46:46.083065 87085 daemon.go:314] Evicted pod openshift-console/downloads-7c44f96fbf-cqmvx So the MCD had to kick it off twice. And checking the tolerations: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056/artifacts/launch/pods.json | jq -c '.items[] | select(.metadata.name | startswith("downloads-")).spec.tolerations[]' | sort | uniq {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120} {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120} {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"} I dunno why they want it to ignore memory-pressure on schedule, but there's no node.kubernetes.io/unschedulable exception that would allow it to schedule on a node that was actually cordoned. So... would be nice if 4.7 had a logged line around successful cordon like master does, but CI at least looks pretty suspicious for missing cordons in 4.7.17, and that meshes with my reading of the code changes. Suggested mitigation is to look for nodes where the machineconfiguration.openshift.io/currentConfig annotation doesn't match the machineconfiguration.openshift.io/desiredConfig, and cordon those nodes by hand. Although I haven't tested that out; you might also need to manually uncordon once they reboot. [1]: https://github.com/openshift/machine-config-operator/pull/2597 [2]: https://github.com/openshift/machine-config-operator/pull/2605 [3]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.7.17 [4]: https://github.com/openshift/machine-config-operator/blame/a29604e/pkg/daemon/drain.go#L92 [5]: https://github.com/openshift/machine-config-operator/pull/2457/commits/1261271ddbc0e3ab28bfc34be80f379db813528a#diff-58db213cf6f[…]f66f0b1e78a946R87 [6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1403169385064960000 [7]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1405073849015341056 Trevor: Fantastic debugging! Indeed, great debugging Trevor. From the log provided, my analysis is as follows: 1. upgrade is stuck because one of the node took lot of time to drain. From pod machine-config-daemon-7bpbt/machine-config-daemon/machine-config-daemon/logs/current.log: MCD kept trying to drain rhsm-subscriptions-cron-hourly-tally-1622822400-n8wwm pod as it has been failing in multiple attempt ... 2021-06-16T16:18:57.511490296Z I0616 16:18:57.511440 741378 update.go:1906] Starting update from rendered-worker-60d93a47abd796e50f45bb4a5d10c1a0 to rendered-worker-06834005d5f3137c8a8671dd784f0833: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} 2021-06-16T16:18:57.526287340Z I0616 16:18:57.526243 741378 update.go:1906] Update prepared; beginning drain ... 2021-06-17T00:19:07.869189167Z E0617 00:19:07.869132 741378 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-110-163-219.ec2.internal after 1 hour ... 2021-06-17T00:45:19.678032427Z I0617 00:45:19.678017 741378 update.go:1906] drain complete 2021-06-17T00:45:19.680626718Z I0617 00:45:19.680593 741378 update.go:231] Successful drain took 1570.45894659 seconds ... Actual OS update started at 2021-06-16T16:18:57 but was stuck at drain until 2021-06-17T00:45:19 2. In the past we didn't have log for successfully cordoning of node. This has been added in 4.8 https://github.com/openshift/machine-config-operator/pull/2457/files#diff-58db213cf6f6259a621f8d3421b5e8af55ed71c9f52c15b255f66f0b1e78a946R87. 3. As Trevor figured it out, we need to bring back lost code during backport for cordonOrUncordonNode() and we can add in log message in 4.7 as well for "Successful cordoning" Credit to Matt for pointing out that OSD's managed-upgrade operator wasn't stomping the slow-to-drain workload because the draining node wasn't cordoned :). All I did was try to follow that lead up in the code and CI. $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2021-06-20-093308 True False 30m Cluster version is 4.7.0-0.nightly-2021-06-20-093308
$ cat pdb.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: dontevict
spec:
minAvailable: 1
selector:
matchLabels:
app: dontevict
$ oc create -f pdb.yaml
poddisruptionbudget.policy/dontevict created
$ oc get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
dontevict 1 N/A 0 4s
$ oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-143-129.us-west-2.compute.internal Ready master 52m v1.20.0+87cc9a4
ip-10-0-145-198.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4
ip-10-0-162-222.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4
ip-10-0-173-218.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4
ip-10-0-220-204.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4
ip-10-0-222-240.us-west-2.compute.internal Ready worker 42m v1.20.0+87cc9a4
$ oc get node/ip-10-0-222-240.us-west-2.compute.internal -o yaml | grep hostname
kubernetes.io/hostname: ip-10-0-222-240
f:kubernetes.io/hostname: {}
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-222-240"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h
pod/dont-evict-this-pod created
$ oc get pod
NAME READY STATUS RESTARTS AGE
dont-evict-this-pod 0/1 ContainerCreating 0 4s
$ cat << EOF > file-ig3.yaml
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfig
> metadata:
> labels:
> machineconfiguration.openshift.io/role: worker
> name: test-file
> spec:
> config:
> ignition:
> version: 3.1.0
> storage:
> files:
> - contents:
> source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
> filesystem: root
> mode: 0644
> path: /etc/test
> EOF
$ oc create file -f file-ig3.yaml
error: Unexpected args: [file]
See 'oc create -h' for help and examples.
$ oc create -f file-ig3.yaml
machineconfig.machineconfiguration.openshift.io/test-file created
$ oc get mc
NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
00-worker 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
01-master-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
01-master-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
01-worker-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
01-worker-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
99-master-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
99-master-ssh 3.2.0 62m
99-worker-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
99-worker-ssh 3.2.0 62m
rendered-master-13a8e238f99c2aae13c24eac159a2db2 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
rendered-worker-96cc244920a4ac522616e39024e1d35d 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m
test-file 3.1.0 3s
$ oc get pods -A --field-selector spec.nodeName=ip-10-0-222-240.us-west-2.compute.internal | grep machine-config-daemon
openshift-machine-config-operator machine-config-daemon-6gkv4 2/2 Running 0 45m
$ oc -n openshift-machine-config-operator logs -f machine-config-daemon-6gkv4 -c machine-config-daemon | grep -5 'Node has been successfully'
I0621 13:25:27.562892 1747 update.go:1943] completed update for config rendered-worker-96cc244920a4ac522616e39024e1d35d
I0621 13:25:27.566468 1747 daemon.go:1167] In desired config rendered-worker-96cc244920a4ac522616e39024e1d35d
I0621 14:13:52.569029 1747 update.go:630] Checking Reconcilable for config rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46
I0621 14:13:52.616935 1747 update.go:1943] Starting update from rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false}
I0621 14:13:52.664672 1747 update.go:542] File diff: /etc/test was added
I0621 14:13:52.676667 1747 update.go:1943] Node has been successfully cordoned
I0621 14:13:52.680568 1747 update.go:1943] Update prepared; beginning drain
E0621 14:13:52.945865 1747 daemon.go:328] WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/dont-evict-this-pod, openshift-marketplace/certified-operators-jxlc7, openshift-marketplace/community-operators-jt2cn, openshift-marketplace/redhat-marketplace-5wf9z, openshift-marketplace/redhat-operators-t5jwl; ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-668pp, openshift-cluster-node-tuning-operator/tuned-c2m87, openshift-dns/dns-default-2q4wc, openshift-image-registry/node-ca-jzgwx, openshift-ingress-canary/ingress-canary-hsqns, openshift-machine-config-operator/machine-config-daemon-6gkv4, openshift-monitoring/node-exporter-qxnl4, openshift-multus/multus-8g2g2, openshift-multus/network-metrics-daemon-9rshb, openshift-network-diagnostics/network-check-target-mp7cg, openshift-sdn/sdn-mqw9g
I0621 14:13:52.951733 1747 daemon.go:328] evicting pod openshift-image-registry/image-registry-6c75544c-6sqw4
I0621 14:13:52.951744 1747 daemon.go:328] evicting pod openshift-ingress/router-default-6fd598466c-qb8jr
I0621 14:13:52.951733 1747 daemon.go:328] evicting pod openshift-network-diagnostics/network-check-source-575c4794b8-7b4tm
--
I0621 15:13:28.922201 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
E0621 15:13:52.683567 1747 writer.go:135] Marking Degraded due to: failed to drain node : ip-10-0-222-240.us-west-2.compute.internal after 1 hour
I0621 15:13:52.711672 1747 update.go:630] Checking Reconcilable for config rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46
I0621 15:13:52.741981 1747 update.go:1943] Starting update from rendered-worker-96cc244920a4ac522616e39024e1d35d to rendered-worker-67458a674a2808ed356c90b65125bc46: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false extensions:false}
I0621 15:13:52.770285 1747 update.go:542] File diff: /etc/test was added
I0621 15:13:52.770394 1747 update.go:1943] Node has been successfully cordoned
I0621 15:13:52.773180 1747 update.go:1943] Update prepared; beginning drain
E0621 15:13:53.006501 1747 daemon.go:328] WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/dont-evict-this-pod; ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-668pp, openshift-cluster-node-tuning-operator/tuned-c2m87, openshift-dns/dns-default-2q4wc, openshift-image-registry/node-ca-jzgwx, openshift-ingress-canary/ingress-canary-hsqns, openshift-machine-config-operator/machine-config-daemon-6gkv4, openshift-monitoring/node-exporter-qxnl4, openshift-multus/multus-8g2g2, openshift-multus/network-metrics-daemon-9rshb, openshift-network-diagnostics/network-check-target-mp7cg, openshift-sdn/sdn-mqw9g
I0621 15:13:53.009706 1747 daemon.go:328] evicting pod default/dont-evict-this-pod
E0621 15:13:53.017078 1747 daemon.go:328] error when evicting pods/"dont-evict-this-pod" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0621 15:13:58.017197 1747 daemon.go:328] evicting pod default/dont-evict-this-pod
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502 |