The MCO is also affected by this bug as we use the library to drain nodes. +++ This bug was initially created as a clone of Bug #1729243 +++ Description of problem: When deleting a machine object, the machine-controller first attempts to cordon and drain the node. Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain. This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Delete worker machine object on 4.1 cluster on IPI. Actual results: Drain reports complete before pods are actually evicted or deleted. Expected results: We should wait to ensure services aren't interrupted. Additional info: Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875 Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234 --- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC --- PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1
*** Bug 1744948 has been marked as a duplicate of this bug. ***
Discussed with Antonio about how to reproduce this for the MCO side of things. Any new MachineConfig will trigger a drain of the node as the new MC is applied, so steps look like: 1. Create cluster 2. Create MachineConfig for worker 3. `oc logs -f <worker MCD pod>` 4. Watch for messages about draining the node ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-08-30-035445 True False 47m Cluster version is 4.1.0-0.nightly-2019-08-30-035445 ### Figure out which nodes are workers $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-140-86.us-west-2.compute.internal Ready master 62m v1.13.4+3bd346709 ip-10-0-143-236.us-west-2.compute.internal Ready worker 55m v1.13.4+3bd346709 ip-10-0-155-84.us-west-2.compute.internal Ready worker 55m v1.13.4+3bd346709 ip-10-0-157-48.us-west-2.compute.internal Ready master 62m v1.13.4+3bd346709 ip-10-0-166-99.us-west-2.compute.internal Ready worker 54m v1.13.4+3bd346709 ip-10-0-168-59.us-west-2.compute.internal Ready master 63m v1.13.4+3bd346709 ### Find a MCD pod on a worker node $ oc -n openshift-machine-config-operator describe po machine-config-daemon-4njpl | grep Node: Node: ip-10-0-155-84.us-west-2.compute.internal/10.0.155.84 ### Create a MachineConfig $ cat ../machineConfigs/chrony.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 60-examplecorp-chrony spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK filesystem: root mode: 0644 path: /etc/chrony.conf ### Apply the MachineConfig $ oc apply -f ../machineConfigs/chrony.yaml machineconfig.machineconfiguration.openshift.io/60-examplecorp-chrony created ### Watch the MCD pod for drain messages $ oc -n openshift-machine-config-operator logs -f pod/machine-config-daemon-4njpl I0903 13:30:57.790426 4480 start.go:67] Version: 4.1.14-201908291507-dirty (e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab) I0903 13:30:57.791293 4480 start.go:100] Starting node writer I0903 13:30:57.796310 4480 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json I0903 13:30:58.038685 4480 daemon.go:200] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b68f108c63ae9f7245621be001cda06d0d909565dc63624530144570c11dba3 (410.8.20190830.0) I0903 13:30:58.044703 4480 start.go:196] Calling chroot("/rootfs") I0903 13:30:58.044739 4480 start.go:206] Starting MachineConfigDaemon I0903 13:30:58.044811 4480 update.go:847] Starting to manage node: ip-10-0-155-84.us-west-2.compute.internal I0903 13:30:58.074577 4480 run.go:22] Running captured: rpm-ostree status I0903 13:30:58.473083 4480 daemon.go:740] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b68f108c63ae9f7245621be001cda06d0d909565dc63624530144570c11dba3 CustomOrigin: Managed by pivot tool Version: 410.8.20190830.0 (2019-08-30T02:31:46Z) pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/ootpa@sha256:683a6a866a8ec789fedb5da63b6a2ff68c1b0788ec90e7def778f0c4c13197a4 CustomOrigin: Provisioned from oscontainer Version: 410.8.20190520.0 (2019-05-20T20:10:04Z) I0903 13:30:58.473113 4480 run.go:22] Running captured: journalctl --list-boots I0903 13:30:58.526842 4480 daemon.go:747] journalctl --list-boots: -1 d7495f00adbb4f6e836df08416314f18 Tue 2019-09-03 13:25:34 UTC—Tue 2019-09-03 13:28:33 UTC 0 7aa2913ce3b943ad8dfbca79db301fac Tue 2019-09-03 13:28:46 UTC—Tue 2019-09-03 13:30:58 UTC I0903 13:30:58.526939 4480 daemon.go:494] Enabling Kubelet Healthz Monitor I0903 13:30:59.451443 4480 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node ip-10-0-155-84.us-west-2.compute.internal: map[machine.openshift.io/machine:openshift-machine-api/miabbott-4-1-14-5t67r-worker-us-west-2b-vcpjd volumes.kubernetes.io/contro ller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json I0903 13:30:59.457046 4480 node.go:45] Setting initial node config: rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 I0903 13:30:59.501338 4480 update.go:734] logger doesn't support --jounald, grepping the journal I0903 13:30:59.542539 4480 update.go:847] error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory I0903 13:30:59.554603 4480 daemon.go:667] In bootstrap mode I0903 13:30:59.554627 4480 daemon.go:695] Current+desired config: rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 I0903 13:30:59.561204 4480 daemon.go:854] No bootstrap pivot required; unlinking bootstrap node annotations I0903 13:30:59.561255 4480 update.go:847] Validating against current config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 I0903 13:30:59.563831 4480 daemon.go:898] Validating against current config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 I0903 13:30:59.576253 4480 daemon.go:904] Validated on-disk state I0903 13:30:59.585952 4480 daemon.go:951] In desired config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 E0903 13:36:10.918151 4480 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=9, ErrCode=NO_ERROR, debug="" E0903 13:36:10.918151 4480 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=9, ErrCode=NO_ERROR, debug="" W0903 13:36:11.141549 4480 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 12587 (14989) I0903 14:26:18.060959 4480 update.go:185] Checking reconcilable for config rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 to rendered-worker-24ffb68b002cb7511301664cb41c0a68 I0903 14:26:18.062693 4480 update.go:847] Starting update from rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 to rendered-worker-24ffb68b002cb7511301664cb41c0a68 I0903 14:26:18.065515 4480 update.go:399] Updating files I0903 14:26:18.065528 4480 update.go:601] Writing file "/etc/tmpfiles.d/cleanup-cni.conf" I0903 14:26:18.067103 4480 update.go:601] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf" I0903 14:26:18.068593 4480 update.go:601] Writing file "/var/lib/kubelet/config.json" I0903 14:26:18.070069 4480 update.go:601] Writing file "/etc/kubernetes/ca.crt" I0903 14:26:18.071754 4480 update.go:601] Writing file "/etc/sysctl.d/forward.conf" I0903 14:26:18.073279 4480 update.go:601] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy" I0903 14:26:18.074157 4480 update.go:601] Writing file "/etc/containers/registries.conf" I0903 14:26:18.075597 4480 update.go:601] Writing file "/etc/containers/storage.conf" I0903 14:26:18.077298 4480 update.go:601] Writing file "/etc/crio/crio.conf" I0903 14:26:18.079188 4480 update.go:601] Writing file "/etc/kubernetes/cloud.conf" I0903 14:26:18.080075 4480 update.go:601] Writing file "/etc/kubernetes/kubelet-ca.crt" I0903 14:26:18.081642 4480 update.go:601] Writing file "/etc/kubernetes/kubelet.conf" I0903 14:26:18.083098 4480 update.go:601] Writing file "/etc/chrony.conf" I0903 14:26:18.084600 4480 update.go:556] Writing systemd unit "kubelet.service" I0903 14:26:18.086184 4480 update.go:573] Enabling systemd unit "kubelet.service" I0903 14:26:18.086275 4480 update.go:493] /etc/systemd/system/multi-user.target.wants/kubelet.service already exists. Not making a new symlink I0903 14:26:18.086287 4480 update.go:418] Deleting stale data I0903 14:26:18.086306 4480 update.go:663] Writing SSHKeys at "/home/core/.ssh/authorized_keys" I0903 14:26:18.089973 4480 update.go:812] logger doesn't support --jounald, logging json directly I0903 14:26:18.093733 4480 update.go:847] Update prepared; beginning drain I0903 14:26:18.101639 4480 update.go:89] cordoned node "ip-10-0-155-84.us-west-2.compute.internal" I0903 14:26:18.144107 4480 update.go:93] ignoring DaemonSet-managed pods: tuned-jvg2z, dns-default-268kq, node-ca-rwrml, machine-config-daemon-4njpl, node-exporter-ldmv7, multus-tpwj4, ovs-4c5sg, sdn-4lml2; deleting pods with local storage: alertmanager-main-2, kube-state-metrics- 64c4bdf979-m6jsj, prometheus-adapter-7f77c6997f-dbbnz, prometheus-adapter-7f77c6997f-ptlrr I0903 14:26:22.056307 4480 update.go:89] pod "router-default-5cf946464-l6st8" removed (evicted) I0903 14:26:22.856358 4480 update.go:89] pod "certified-operators-c4ff7fc9b-8kclm" removed (evicted) I0903 14:26:24.256264 4480 update.go:89] pod "prometheus-adapter-7f77c6997f-ptlrr" removed (evicted) I0903 14:26:24.656182 4480 update.go:89] pod "redhat-operators-7699569b86-fnvtm" removed (evicted) I0903 14:26:26.456437 4480 update.go:89] pod "image-registry-f7cd488bf-xmth9" removed (evicted) I0903 14:26:26.656932 4480 update.go:89] pod "alertmanager-main-2" removed (evicted) I0903 14:26:27.456095 4480 update.go:89] pod "telemeter-client-b65c57ff7-kh6h9" removed (evicted) I0903 14:26:27.856365 4480 update.go:89] pod "community-operators-6f95cf67c6-fh8l6" removed (evicted) I0903 14:26:28.056288 4480 update.go:89] pod "prometheus-adapter-7f77c6997f-dbbnz" removed (evicted) I0903 14:26:29.411252 4480 update.go:89] pod "kube-state-metrics-64c4bdf979-m6jsj" removed (evicted) I0903 14:26:29.411283 4480 update.go:89] drained node "ip-10-0-155-84.us-west-2.compute.internal" I0903 14:26:29.411291 4480 update.go:847] drain complete I0903 14:26:29.413821 4480 update.go:847] initiating reboot: Node will reboot into config rendered-worker-24ffb68b002cb7511301664cb41c0a68 I0903 14:26:29.471061 4480 start.go:215] Shutting down MachineConfigDaemon ### Verify MachineConfigPool is updated and all nodes are healthy $ oc get machineconfig NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED 00-master e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 00-worker e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 01-master-container-runtime e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 01-master-kubelet e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 01-worker-container-runtime e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 01-worker-kubelet e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 60-examplecorp-chrony 2.2.0 12m 99-master-f1c24a07-ce4d-11e9-99b1-02f047b21df0-registries e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 99-master-ssh 2.2.0 75m 99-worker-f1c3fa01-ce4d-11e9-99b1-02f047b21df0-registries e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m 99-worker-ssh 2.2.0 75m rendered-master-8b9815ae15d192479af438faa89e5271 e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m rendered-worker-24ffb68b002cb7511301664cb41c0a68 e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 12m rendered-worker-ed526cce1edf9d8c4d643d17760ccd25 e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 74m $ oc describe machineconfigpool/worker Name: worker Namespace: Labels: <none> Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-09-03T13:22:57Z Generation: 3 Resource Version: 33272 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker UID: f1c3fa01-ce4d-11e9-99b1-02f047b21df0 Spec: Configuration: Name: rendered-worker-24ffb68b002cb7511301664cb41c0a68 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 60-examplecorp-chrony API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-f1c3fa01-ce4d-11e9-99b1-02f047b21df0-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: worker Max Unavailable: <nil> Node Selector: Match Labels: node-role.kubernetes.io/worker: Paused: false Status: Conditions: Last Transition Time: 2019-09-03T13:23:31Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-09-03T13:23:36Z Message: Reason: Status: False Type: NodeDegraded Last Transition Time: 2019-09-03T13:23:36Z Message: Reason: Status: False Type: Degraded Last Transition Time: 2019-09-03T14:35:45Z Message: Reason: All nodes are updated with rendered-worker-24ffb68b002cb7511301664cb41c0a68 Status: True Type: Updated Last Transition Time: 2019-09-03T14:35:45Z Message: Reason: Status: False Type: Updating Configuration: Name: rendered-worker-24ffb68b002cb7511301664cb41c0a68 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 60-examplecorp-chrony API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-f1c3fa01-ce4d-11e9-99b1-02f047b21df0-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Degraded Machine Count: 0 Machine Count: 3 Observed Generation: 3 Ready Machine Count: 3 Unavailable Machine Count: 0 Updated Machine Count: 3 Events: <none> $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-140-86.us-west-2.compute.internal Ready master 76m v1.13.4+3bd346709 ip-10-0-143-236.us-west-2.compute.internal Ready worker 69m v1.13.4+3bd346709 ip-10-0-155-84.us-west-2.compute.internal Ready worker 69m v1.13.4+3bd346709 ip-10-0-157-48.us-west-2.compute.internal Ready master 76m v1.13.4+3bd346709 ip-10-0-166-99.us-west-2.compute.internal Ready worker 69m v1.13.4+3bd346709 ip-10-0-168-59.us-west-2.compute.internal Ready master 77m v1.13.4+3bd346709 ``` ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2594