Bug 1702626 - machine-config-daemon mark node as degraded due to "open /etc/machine-config-daemon/node-annotations.json: no such file or directory"
Summary: machine-config-daemon mark node as degraded due to "open /etc/machine-config-...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.z
Assignee: Antonio Murdaca
QA Contact: Micah Abbott
URL:
Whiteboard: 4.1.2
Depends On:
Blocks: 1717970 1718956
TreeView+ depends on / blocked
 
Reported: 2019-04-24 09:53 UTC by weiwei jiang
Modified: 2019-06-19 06:45 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1717970 (view as bug list)
Environment:
Last Closed: 2019-06-19 06:45:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1382 0 None None None 2019-06-19 06:45:44 UTC

Description weiwei jiang 2019-04-24 09:53:14 UTC
Description of problem:
Install the cluster with bare metal way, after couple hours, clusterversion report machine-config is degraded. But only 2/5 nodes met this issue on the cluster.


Version-Release number of selected component (if applicable):
# oc get clusterversion                                                                 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version    4.1.0-0.nightly-2019-04-22-005054   True        False         20h      Cluster version is 4.1.0-0.nightly-2019-04-22-005054



How reproducible:
Sometimes

Steps to Reproduce:
1. Install the cluster via bare metal way
2. Check machine-config operator after couple hours 
3.

Actual results:
# oc get machineconfig rendered-master-c1f7f8d33728c3a053f65041b5f91b98 -o yaml | grep -i path:
        path: /etc/tmpfiles.d/cleanup-cni.conf
        path: /etc/kubernetes/manifests/etcd-member.yaml
        path: /etc/kubernetes/static-pod-resources/etcd-member/ca.crt
        path: /etc/kubernetes/static-pod-resources/etcd-member/metric-ca.crt
        path: /etc/kubernetes/static-pod-resources/etcd-member/root-ca.crt
        path: /etc/systemd/system.conf.d/kubelet-cgroups.conf
        path: /var/lib/kubelet/config.json
        path: /etc/kubernetes/ca.crt
        path: /etc/sysctl.d/forward.conf
        path: /etc/kubernetes/kubelet-plugins/volume/exec/.dummy
        path: /etc/containers/registries.conf
        path: /etc/containers/storage.conf
        path: /etc/crio/crio.conf
        path: /etc/kubernetes/kubelet.conf
# oc get machineconfig rendered-worker-c8835fe6fb163ad0a3613ade0af0950f -o yaml |grep -i path:
        path: /etc/tmpfiles.d/cleanup-cni.conf
        path: /etc/systemd/system.conf.d/kubelet-cgroups.conf
        path: /var/lib/kubelet/config.json
        path: /etc/kubernetes/ca.crt
        path: /etc/sysctl.d/forward.conf
        path: /etc/kubernetes/kubelet-plugins/volume/exec/.dummy
        path: /etc/containers/registries.conf
        path: /etc/containers/storage.conf
        path: /etc/crio/crio.conf
        path: /etc/kubernetes/kubelet.conf

# oc -n openshift-machine-config-operator logs machine-config-daemon-k4mxd
W0424 02:29:35.466552  119274 daemon.go:308] Booting the MCD errored with failed to read initial annotations from "/etc/machine-config-daemon/node-annotations.json": open /etc/machine-config-daemon/node-annotations.json: no such file or directory
E0424 02:29:35.466747  119274 writer.go:119] Marking Degraded due to: failed to read initial annotations from "/etc/machine-config-daemon/node-annotations.json": open /etc/machine-config-daemon/node-annotations.json: no such file or directory

# oc -n openshift-machine-config-operator logs machine-config-operator-5fddfb5cc6-f6nfv
I0424 00:41:44.228867       1 start.go:42] Version: 4.1.0-201904211700-dirty
I0424 00:41:44.231718       1 leaderelection.go:205] attempting to acquire leader lease  openshift-machine-config-operator/machine-config...
E0424 00:43:39.876927       1 event.go:259] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"e6f71464-659a-11e9-a21d-801844eef6b8", ResourceVersion:"802862", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63691601958, loc:(*time.Location)(0x1d93560)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-5fddfb5cc6-f6nfv_bbd1df79-6629-11e9-98e4-0a580a80024f\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-04-24T00:43:39Z\",\"renewTime\":\"2019-04-24T00:43:39Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-5fddfb5cc6-f6nfv_bbd1df79-6629-11e9-98e4-0a580a80024f became leader'
I0424 00:43:39.877661       1 leaderelection.go:214] successfully acquired lease openshift-machine-config-operator/machine-config
I0424 00:43:39.880140       1 operator.go:193] Starting MachineConfigOperator
E0424 00:45:10.299126       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:46:34.134281       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:48:03.760504       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:49:33.768100       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:51:03.802832       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:52:33.801943       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:54:03.814132       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:55:33.855980       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:57:03.850759       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)
E0424 00:58:33.874057       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 2, unavailable: 0)

# oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-04-22-005054   True        False         18h     Error while reconciling 4.1.0-0.nightly-2019-04-22-005054: the cluster operator machine-config is degraded


Expected results:
Should not met such issue

Additional info:

Comment 1 Antonio Murdaca 2019-04-24 10:48:49 UTC
Can you provide must-gather logs for openshift-machine-config-operator namespace?

Comment 2 Antonio Murdaca 2019-04-24 10:58:21 UTC
Ok figured, I'm creating a patch, those 2 nodes anyway, are having issue talking to the API server if they got there (which is not related to MCO)

Comment 3 Antonio Murdaca 2019-04-24 13:14:41 UTC
Nevermind, it looks like having Machine config server logs would shed some light.

Also, what does it mean 'after a couple of hours'? is the cluster and every node going up initially but then you start seeing this?

There are 2 possible scenarios as to why we're getting there:

- assuming the real bootstrap went well, we remove that file, and the next iteration of the MCD should just grab annotation from the node object coming from the informer
- no initial annotation file has ever been written (which is gonna be super bad)

If you can provide must-gather as I asked above, that's gonna help hopefully.

Comment 4 weiwei jiang 2019-04-25 02:22:43 UTC
(In reply to Antonio Murdaca from comment #3)
> Nevermind, it looks like having Machine config server logs would shed some
> light.
> 
> Also, what does it mean 'after a couple of hours'? is the cluster and every
> node going up initially but then you start seeing this?

yeah, every node going up during installation, and during our function testing 
2/5 nodes goes down due to this.
Since no stable steps to reproduce this, so I just say "after a couple of hours".
And I also apply a temp patch to bypass this issue to not block our testing via
touch the /etc/machine-config-daemon/node-annotations.json with following format content:
{"machineconfiguration.openshift.io/currentConfig":"3aef043ad5aa416e240b6f207c5cd3b0","machineconfiguration.openshift.io/desiredConfig":"3aef043ad5aa416e240b6f207c5cd3b0"}
Then all nodes come back to serve.

> 
> There are 2 possible scenarios as to why we're getting there:
> 
> - assuming the real bootstrap went well, we remove that file, and the next
> iteration of the MCD should just grab annotation from the node object coming
> from the informer
> - no initial annotation file has ever been written (which is gonna be super
> bad)
> 
> If you can provide must-gather as I asked above, that's gonna help hopefully.

During the function testing, we have already reboot the whole cluster, so I think
All the useful logs should be flushed, my fault.

Comment 5 Antonio Murdaca 2019-04-25 11:54:06 UTC
We need full logs or steps to retest and reproduce, there's little we can debug other than looking at the code. Looking at the code tells us that the node object doesn't contain the current/desired annotation which is something that shouldn't happen and if it's happening, something in logs (and nodes yaml) would shed some light.

Comment 6 Antonio Murdaca 2019-04-25 12:03:55 UTC
Also, if you're hitting that code path, it means that 2/5 nodes rebooted right? can you confirm that and give us a highlight of what your testing does?

Comment 7 Antonio Murdaca 2019-04-29 14:09:03 UTC
Any update on this and/or logs to provide to us to further debug this issue?

Comment 8 weiwei jiang 2019-04-30 07:22:17 UTC
(In reply to Antonio Murdaca from comment #7)
> Any update on this and/or logs to provide to us to further debug this issue?

Sorry we did not met this after following version till now.
we can keep this open for a while, if we met this issue next time, I will try to gather some info for debuging.

Comment 9 Antonio Murdaca 2019-05-02 13:51:13 UTC
So should we drop the beta blocker from this or close this and you reopen it if you come across this again?

Comment 10 Antonio Murdaca 2019-05-02 17:35:47 UTC
This is not apparently happening anymore so I'm leaning toward closing it and if this happens again, we'll reopen

Comment 12 Antonio Murdaca 2019-05-30 10:33:09 UTC
I'm playing around with the cluster provided. It's not super clear why current/desired annotaions aren't on the node from the API. The cluster/nodes isn't bootstrapping so those annotations must have been there but they aren't for some reason. I've asked in forum-api-review if it's at all possible to lose annotations (somehow during a strategic merge patch?). The node-annotations.json file is _correctly_ not on the host also so that's just a red herring, the real issue is that we have a node w/o MCD annotations. This is UPI on Baremetal + mixed RHEL7/RHCOS workers for context.

I'm hesitant to go ahead and change the code to reconcile this situation (by adding annotations based on current on disk).

Comment 13 Antonio Murdaca 2019-05-30 10:46:15 UTC
Journal shows that we correctly went and rebooted into the desired config (so we _had_ annotations):

May 28 05:42:11 dell-r730-063.dsal.lab.eng.rdu2.redhat.com root[108702]: machine-config-daemon[3558]: Starting update from rendered-master-91b1217fc1c1f69a0ff9d8f29a4c268d to rendered-master-d7ebe137310773da1c561ca3e2eb4752
May 28 05:42:11 dell-r730-063.dsal.lab.eng.rdu2.redhat.com hyperkube[1939]: I0528 05:42:11.820222    1939 file.go:200] Reading config file "/etc/kubernetes/manifests/etcd-member.yaml"
May 28 05:42:11 dell-r730-063.dsal.lab.eng.rdu2.redhat.com root[108708]: {"MESSAGE": "rendered-master-d7ebe137310773da1c561ca3e2eb4752", "BOOT_ID": "3fd10968-b436-47d1-a8d7-6d5498875ffa", "PENDING": "1", "OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK": "1"}
May 28 05:42:11 dell-r730-063.dsal.lab.eng.rdu2.redhat.com root[108709]: machine-config-daemon[3558]: Update prepared; beginning drain


I'm testing with the first master in https://bugzilla.redhat.com/show_bug.cgi?id=1702626#c11

On this node, I don't see anything obvious about why we lost annotations though, there's just an error unmounting /var:

May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: systemd-fsck@dev-disk-by\x2duuid-68453234\x2d96ce\x2d47e7\x2db9ad\x2>
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: Removed slice system-systemd\x2dfsck.slice.
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: system-systemd\x2dfsck.slice: Consumed 13ms CPU time
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: Unmounting /var/lib/containers/storage/overlay...
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: Stopped target Swap.
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: Unmounted /var/lib/containers/storage/overlay.
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: var-lib-containers-storage-overlay.mount: Consumed 2ms CPU time
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: Unmounting /var...
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com umount[111614]: umount: /var: target is busy.
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: var.mount: Mount process exited, code=exited status=32
May 28 05:42:21 dell-r730-063.dsal.lab.eng.rdu2.redhat.com systemd[1]: Failed unmounting /var.

Comment 14 Antonio Murdaca 2019-05-30 14:27:56 UTC
Can you shed some light about what this "fuction testing" does to the cluster https://bugzilla.redhat.com/show_bug.cgi?id=1702626#c4 is it rejoining/adding nodes to the cluster?

There are 2 workaround for this issue also, one is in https://bugzilla.redhat.com/show_bug.cgi?id=1702626#c4 and the other is adding those annotations directly to the nodes.

Comment 15 Antonio Murdaca 2019-05-30 14:44:55 UTC
The 2 failing nodes look "newer" as well:

16:40:55 [~] oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
dell-r730-063.dsal.lab.eng.rdu2.redhat.com   Ready    master   28h    v1.13.4+cb455d664
dell-r730-064.dsal.lab.eng.rdu2.redhat.com   Ready    master   6d7h   v1.13.4+cb455d664
dell-r730-065.dsal.lab.eng.rdu2.redhat.com   Ready    master   6d7h   v1.13.4+cb455d664
dell-r730-066.dsal.lab.eng.rdu2.redhat.com   Ready    worker   28h    v1.13.4+cb455d664
dell-r730-067.dsal.lab.eng.rdu2.redhat.com   Ready    worker   6d7h   v1.13.4+cb455d664
dell-r730-068.dsal.lab.eng.rdu2.redhat.com   Ready    worker   6d6h   v1.13.4+54aa63688


How are those brought in in the baremetal UPI scenario? do you grab the ignition from somewhere?

Comment 16 Antonio Murdaca 2019-05-30 14:53:12 UTC
Looks like those nodes have been added yesterday:

CreationTimestamp:  Wed, 29 May 2019 11:58:29 +0200


So, how have they been added?

Comment 19 Antonio Murdaca 2019-05-31 08:16:21 UTC
What it now looks like to me is that those nodes have been added w/o properly going through the necessary setup/ignition and therefore they're missing the initial node annotation which is required for us to function correctly.

Comment 20 Antonio Murdaca 2019-05-31 13:03:53 UTC
I'm tentatively moving this to 4.1.z because:

- we have 2 workaround
- it's still unclear if there's manual intervention which could cause this (hence the bug may be somewhere else)
- it's not disrupting upgrades (happens after an idle time)
- nodes are removed/added back to the cluster which is still unclear how/when/what's doing that
- There's a PR in MCO that attempt to reconcile such scenario also

Comment 22 Colin Walters 2019-05-31 13:11:36 UTC
> What it now looks like to me is that those nodes have been added w/o properly going through the necessary setup/ignition and therefore they're missing the initial node annotation which is required for us to function correctly.

If that's being tested here it's not something we have thought about upstream; Antonio is looking at it, but it would be very helpful for us to have the code being used for test.

Concur with Antonio in general; we have workarounds, I don't think there is any realistic way this is a 4.1.0 blocker.

Comment 24 Antonio Murdaca 2019-05-31 13:41:46 UTC
is DNS setup correctly also and the bootstrap node is gone for that baremetal installation? I was pointed to the fact that those nodes could still get the initial ignition from bootstrap also if it's not gone

Comment 27 Antonio Murdaca 2019-05-31 13:59:16 UTC
Ok, reproduced on aws as well :)

- oc delete <node>
- ssh to the node
- systemctl restart kubelet
- node comes up again and it's registered with the apiserver but no annotations anymore

What's the component for the above scenario?

Comment 28 Antonio Murdaca 2019-05-31 14:03:25 UTC
If the MCO should still be in charge of such scenario, I believe we'll need something like https://github.com/openshift/machine-config-operator/pull/807 but I'm unsure on that. It's actually expected that we don't have annotations after deleting the node from etcd right?

Comment 29 Antonio Murdaca 2019-05-31 14:50:30 UTC
Fix on the MCO side is here and I've tested it as well https://github.com/openshift/machine-config-operator/pull/807 (we might want a proper test on QE side as well based on my steps here https://bugzilla.redhat.com/show_bug.cgi?id=1702626#c27)

Comment 30 Antonio Murdaca 2019-06-04 11:44:06 UTC
PR merged, moving to MODIFIED

Comment 33 Antonio Murdaca 2019-06-05 19:19:33 UTC
https://github.com/openshift/machine-config-operator/pull/821

Cherry-pick created there, who's approving backports? Clayton?

Comment 34 W. Trevor King 2019-06-05 23:12:07 UTC
https://github.com/openshift/machine-config-operator/pull/821 is still open, moving to POST.

Comment 37 Antonio Murdaca 2019-06-07 07:57:56 UTC
Steps to reproduce and test are here https://bugzilla.redhat.com/show_bug.cgi?id=1702626#c27

Comment 38 Micah Abbott 2019-06-07 15:17:16 UTC
This appears fixed in 4.1.0-0.nightly-2019-06-06-160120

```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                     
version   4.1.0-0.nightly-2019-06-06-160120   True        False         20m     Cluster version is 4.1.0-0.nightly-2019-06-06-160120

$ oc get nodes                                             
NAME                                         STATUS   ROLES    AGE   VERSION                               
ip-10-0-133-148.us-west-2.compute.internal   Ready    master   39m   v1.13.4+4b47d3394                          
ip-10-0-140-148.us-west-2.compute.internal   Ready    worker   33m   v1.13.4+4b47d3394                     
ip-10-0-155-214.us-west-2.compute.internal   Ready    worker   33m   v1.13.4+4b47d3394                          
ip-10-0-156-57.us-west-2.compute.internal    Ready    master   39m   v1.13.4+4b47d3394            
ip-10-0-161-196.us-west-2.compute.internal   Ready    master   39m   v1.13.4+4b47d3394                     
ip-10-0-172-201.us-west-2.compute.internal   Ready    worker   33m   v1.13.4+4b47d3394

### check annotations

$ oc get nodes -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'                                           
===> node:> ip-10-0-133-148.us-west-2.compute.internal                                                     
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-master-0                         
machineconfiguration.openshift.io/currentConfig : rendered-master-ed6a478d088833b088254cbc5709e821 
machineconfiguration.openshift.io/desiredConfig : rendered-master-ed6a478d088833b088254cbc5709e821 
machineconfiguration.openshift.io/state : Done                                                     
volumes.kubernetes.io/controller-managed-attach-detach : true                                                   
                                                                                        
===> node:> ip-10-0-140-148.us-west-2.compute.internal                                             
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-worker-us-west-2a-h4zhv
machineconfiguration.openshift.io/currentConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/desiredConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true

===> node:> ip-10-0-155-214.us-west-2.compute.internal
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-worker-us-west-2b-g6ltv
machineconfiguration.openshift.io/currentConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/desiredConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true

===> node:> ip-10-0-156-57.us-west-2.compute.internal
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-master-1
machineconfiguration.openshift.io/currentConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/desiredConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true

===> node:> ip-10-0-161-196.us-west-2.compute.internal
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-master-2
machineconfiguration.openshift.io/currentConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/desiredConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true

===> node:> ip-10-0-172-201.us-west-2.compute.internal
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-worker-us-west-2c-2sjdn
machineconfiguration.openshift.io/currentConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/desiredConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true

### delete node

$ oc delete node ip-10-0-140-148.us-west-2.compute.internal
node "ip-10-0-140-148.us-west-2.compute.internal" deleted

$ oc get nodes                                                                                                    
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-133-148.us-west-2.compute.internal   Ready    master   39m   v1.13.4+4b47d3394
ip-10-0-155-214.us-west-2.compute.internal   Ready    worker   33m   v1.13.4+4b47d3394
ip-10-0-156-57.us-west-2.compute.internal    Ready    master   39m   v1.13.4+4b47d3394
ip-10-0-161-196.us-west-2.compute.internal   Ready    master   39m   v1.13.4+4b47d3394
ip-10-0-172-201.us-west-2.compute.internal   Ready    worker   33m   v1.13.4+4b47d3394

### ssh to node; restart kubelet

$ ./ssh.sh ip-10-0-140-148.us-west-2.compute.internal                                                              
[root@ip-10-0-140-148 ~]#                                                                                                                                                                                                                                                                                                                                                                                      
[root@ip-10-0-140-148 ~]# systemctl restart kubelet                                                                                                                                                                  
[root@ip-10-0-140-148 ~]# Connection to ip-10-0-140-148.us-west-2.compute.internal closed by remote host.

### check for return of node (takes a bit of time to become Ready)
### note how "young" the returned node is

$ oc get nodes
NAME                                         STATUS                     ROLES    AGE   VERSION    
ip-10-0-133-148.us-west-2.compute.internal   Ready                      master   40m   v1.13.4+4b47d3394                                                                                                            
ip-10-0-140-148.us-west-2.compute.internal   Ready,SchedulingDisabled   worker   46s   v1.13.4+4b47d3394
ip-10-0-155-214.us-west-2.compute.internal   Ready                      worker   34m   v1.13.4+4b47d3394
ip-10-0-156-57.us-west-2.compute.internal    Ready                      master   40m   v1.13.4+4b47d3394
ip-10-0-161-196.us-west-2.compute.internal   Ready                      master   40m   v1.13.4+4b47d3394
ip-10-0-172-201.us-west-2.compute.internal   Ready                      worker   34m   v1.13.4+4b47d3394

$ oc get nodes
NAME                                         STATUS                        ROLES    AGE   VERSION 
ip-10-0-133-148.us-west-2.compute.internal   Ready                         master   40m   v1.13.4+4b47d3394
ip-10-0-140-148.us-west-2.compute.internal   NotReady,SchedulingDisabled   worker   66s   v1.13.4+4b47d3394
ip-10-0-155-214.us-west-2.compute.internal   Ready                         worker   35m   v1.13.4+4b47d3394
ip-10-0-156-57.us-west-2.compute.internal    Ready                         master   40m   v1.13.4+4b47d3394
ip-10-0-161-196.us-west-2.compute.internal   Ready                         master   41m   v1.13.4+4b47d3394
ip-10-0-172-201.us-west-2.compute.internal   Ready                         worker   35m   v1.13.4+4b47d3394

$ oc get nodes                                                                                                    
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-133-148.us-west-2.compute.internal   Ready    master   44m     v1.13.4+4b47d3394                                                                                                                            
ip-10-0-140-148.us-west-2.compute.internal   Ready    worker   4m15s   v1.13.4+4b47d3394
ip-10-0-155-214.us-west-2.compute.internal   Ready    worker   38m     v1.13.4+4b47d3394
ip-10-0-156-57.us-west-2.compute.internal    Ready    master   43m     v1.13.4+4b47d3394
ip-10-0-161-196.us-west-2.compute.internal   Ready    master   44m     v1.13.4+4b47d3394
ip-10-0-172-201.us-west-2.compute.internal   Ready    worker   38m     v1.13.4+4b47d3394

### check annotations again

$ oc get nodes -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v :
= .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'                              
===> node:> ip-10-0-133-148.us-west-2.compute.internal                                             
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-master-0                 
machineconfiguration.openshift.io/currentConfig : rendered-master-ed6a478d088833b088254cbc5709e821                                                                                                                  
machineconfiguration.openshift.io/desiredConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/state : Done                                                                                                                                                                      
volumes.kubernetes.io/controller-managed-attach-detach : true                           
                                                                                        
===> node:> ip-10-0-140-148.us-west-2.compute.internal                                  
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-worker-us-west-2a-h4zhv
machineconfiguration.openshift.io/currentConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/desiredConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a                                                                                                                   
machineconfiguration.openshift.io/reason :                           
machineconfiguration.openshift.io/ssh : accessed      
machineconfiguration.openshift.io/state : Done                                    
volumes.kubernetes.io/controller-managed-attach-detach : true                                     
                                                                                                  
===> node:> ip-10-0-155-214.us-west-2.compute.internal
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-worker-us-west-2b-g6ltv
machineconfiguration.openshift.io/currentConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/desiredConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/state : Done                                                   
volumes.kubernetes.io/controller-managed-attach-detach : true                                     
                                                                                                  
===> node:> ip-10-0-156-57.us-west-2.compute.internal
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-master-1
machineconfiguration.openshift.io/currentConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/desiredConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
                                                                                                 
===> node:> ip-10-0-161-196.us-west-2.compute.internal                                            
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-master-2                
machineconfiguration.openshift.io/currentConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/desiredConfig : rendered-master-ed6a478d088833b088254cbc5709e821
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
                                                                                  
===> node:> ip-10-0-172-201.us-west-2.compute.internal                                            
machine.openshift.io/machine : openshift-machine-api/miabbott-4-1-1-m7wwc-worker-us-west-2c-2sjdn 
machineconfiguration.openshift.io/currentConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/desiredConfig : rendered-worker-71b0bb2aa39c67b5a48ee05f1de4c28a
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true

### check cluster health

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-06-06-160120   True        False         31m     Cluster version is 4.1.0-0.nightly-2019-06-06-160120

```

Thanks to Antonio and Weiwei for excellent reproduction steps!

Comment 40 errata-xmlrpc 2019-06-19 06:45:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1382


Note You need to log in before you can comment on or make changes to this bug.