I thought we had this fixed but I guess not. Nodes going degraded and MCO jammed up because node's currentConfig has been deleted. $ oc get clusteroperator machine-config -oyaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-04-18T21:03:17Z" generation: 1 name: machine-config resourceVersion: "18671" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config uid: 63b2c5a8-621d-11e9-91aa-fa163e87d254 spec: {} status: conditions: - lastTransitionTime: "2019-04-18T21:03:17Z" message: Cluster not available for 4.1.0-0.okd-2019-04-18-203943 status: "False" type: Available - lastTransitionTime: "2019-04-18T21:03:17Z" message: Cluster is bootstrapping 4.1.0-0.okd-2019-04-18-203943 status: "True" type: Progressing - lastTransitionTime: "2019-04-18T21:04:17Z" message: 'Failed to resync 4.1.0-0.okd-2019-04-18-203943 because: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 1)' reason: 'error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 1)' status: "True" type: Failing extension: master: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-2a44c08aee389f30a3b9fac27fc7b655 worker: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-worker-385d6756366aa653245dfd791f5d348c relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces versions: - name: operator version: 4.1.0-0.okd-2019-04-18-203943 $ oc get machineconfig NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED 00-master 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m 00-worker 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m 01-master-container-runtime 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m 01-master-kubelet 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m 01-worker-container-runtime 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m 01-worker-kubelet 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m 99-master-63ba5a7a-621d-11e9-91aa-fa163e87d254-registries 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m 99-worker-63bf91c8-621d-11e9-91aa-fa163e87d254-registries 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m rendered-master-2a44c08aee389f30a3b9fac27fc7b655 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m rendered-worker-385d6756366aa653245dfd791f5d348c 4.0.0-alpha.0-211-gbdedef5b-dirty 2.2.0 20m $ for i in $(oc get pod | grep daemon | awk '{print $1}'); do echo $i; oc logs $i | tail -n 5 | grep Degraded; done machine-config-daemon-4c6nz E0418 21:23:03.591366 2524 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-worker-32f2cd05a2b97700624d8f5b231a56e8" not found machine-config-daemon-6m9n8 E0418 21:23:04.389970 4615 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-5600e489f20c5c6a4d4419503d738e09" not found machine-config-daemon-kr74d E0418 21:23:05.163767 5242 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-5600e489f20c5c6a4d4419503d738e09" not found machine-config-daemon-l48fv E0418 21:23:05.885322 5588 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-5600e489f20c5c6a4d4419503d738e09" not found machine-config-daemon-n46fn E0418 21:23:06.733116 2867 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-worker-32f2cd05a2b97700624d8f5b231a56e8" not found machine-config-daemon-wpdpv E0418 21:23:07.326743 2522 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-worker-32f2cd05a2b97700624d8f5b231a56e8" not found $ oc get nodes -oyaml | grep -e " name:" -e machineconfig machineconfiguration.openshift.io/currentConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09 machineconfiguration.openshift.io/desiredConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09 machineconfiguration.openshift.io/state: Degraded name: master-0 machineconfiguration.openshift.io/currentConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09 machineconfiguration.openshift.io/desiredConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09 machineconfiguration.openshift.io/state: Degraded name: master-1 machineconfiguration.openshift.io/currentConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09 machineconfiguration.openshift.io/desiredConfig: rendered-master-2a44c08aee389f30a3b9fac27fc7b655 machineconfiguration.openshift.io/state: Degraded name: master-2 machineconfiguration.openshift.io/currentConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8 machineconfiguration.openshift.io/desiredConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8 machineconfiguration.openshift.io/state: Degraded name: worker-0 machineconfiguration.openshift.io/currentConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8 machineconfiguration.openshift.io/desiredConfig: rendered-worker-385d6756366aa653245dfd791f5d348c machineconfiguration.openshift.io/state: Degraded name: worker-1 machineconfiguration.openshift.io/currentConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8 machineconfiguration.openshift.io/desiredConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8 machineconfiguration.openshift.io/state: Degraded name: worker-2
(In reply to Seth Jennings from comment #0) > I thought we had this fixed but I guess not. Nodes going degraded and MCO > jammed up because node's currentConfig has been deleted. > Degraded is _fixed_ as in, the MCD keeps retrying but this a condition where we can't really reconcile in any way so there's little we can fix... This is caused by a drift between installer bootstrap and in-cluster MCO bringup. I can't see how this failed and the release page showing a green build for that payload https://origin-release.svc.ci.openshift.org/ You're using 4.1.0-0.okd-2019-04-18-203943 which is green.
also, is this a plain install from installer master with that payload? has something changed in bootstrap installer which causes a drift in what the MCO has in-cluster? Is installing from that payload 100% reproducible?
Last thing to check: - installer version did you build installer master and just run it? or did you grab that payload, extracted the installer and installed from that payload's installer?
is this libvirt or aws?
Ok, clarified on Slack this is a bare metal install, copying what I wrote on Slack: @sjenning I'm not sure on how to proceed with that bug tho, it looks like the MCs generated at install bootstrap differs for the ones being generated once the MCO goes on in the cluster, if you have any idea that might help
As per slack conversation, this was the result of a certificate change for kubeconfig so something not reproducible on builds (just a test I guess). We'll follow up on github/slack to assist on the change.
Ok, turns out this is caused by skew between the installer and MCO. Basically the MCO assumes that it can reconstruct the exact MC that the bootstrap process constructs. If there is any mismatch, cluster is hosed on arrival with all nodes in unrecoverable Degraded state. In this particular case, I was testing a change in the installer, resulting in the skew. I'll close since I introduced the skew but this is brittle...