Bug 1701409

Summary:	All nodes degraded and MCO Available=False on fresh install
Product:	OpenShift Container Platform	Reporter:	Seth Jennings <sjenning>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED NOTABUG	QA Contact:	Micah Abbott <miabbott>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	fshaikh, kgarriso
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-07 17:36:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Seth Jennings 2019-04-18 21:35:09 UTC

I thought we had this fixed but I guess not.  Nodes going degraded and MCO jammed up because node's currentConfig has been deleted.

$ oc get clusteroperator machine-config -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-04-18T21:03:17Z"
  generation: 1
  name: machine-config
  resourceVersion: "18671"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: 63b2c5a8-621d-11e9-91aa-fa163e87d254
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-04-18T21:03:17Z"
    message: Cluster not available for 4.1.0-0.okd-2019-04-18-203943
    status: "False"
    type: Available
  - lastTransitionTime: "2019-04-18T21:03:17Z"
    message: Cluster is bootstrapping 4.1.0-0.okd-2019-04-18-203943
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-04-18T21:04:17Z"
    message: 'Failed to resync 4.1.0-0.okd-2019-04-18-203943 because: error pool master
      is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 1)'
    reason: 'error pool master is not ready, retrying. Status: (total: 3, updated:
      0, unavailable: 1)'
    status: "True"
    type: Failing
  extension:
    master: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-2a44c08aee389f30a3b9fac27fc7b655
    worker: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-worker-385d6756366aa653245dfd791f5d348c
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  versions:
  - name: operator
    version: 4.1.0-0.okd-2019-04-18-203943

$ oc get machineconfig
NAME                                                        GENERATEDBYCONTROLLER               IGNITIONVERSION   CREATED
00-master                                                   4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
00-worker                                                   4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
01-master-container-runtime                                 4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
01-master-kubelet                                           4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
01-worker-container-runtime                                 4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
01-worker-kubelet                                           4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
99-master-63ba5a7a-621d-11e9-91aa-fa163e87d254-registries   4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
99-worker-63bf91c8-621d-11e9-91aa-fa163e87d254-registries   4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
rendered-master-2a44c08aee389f30a3b9fac27fc7b655            4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m
rendered-worker-385d6756366aa653245dfd791f5d348c            4.0.0-alpha.0-211-gbdedef5b-dirty   2.2.0             20m

$ for i in $(oc get pod | grep daemon | awk '{print $1}'); do echo $i; oc logs $i | tail -n 5 | grep Degraded; done
machine-config-daemon-4c6nz
E0418 21:23:03.591366    2524 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-worker-32f2cd05a2b97700624d8f5b231a56e8" not found
machine-config-daemon-6m9n8
E0418 21:23:04.389970    4615 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-5600e489f20c5c6a4d4419503d738e09" not found
machine-config-daemon-kr74d
E0418 21:23:05.163767    5242 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-5600e489f20c5c6a4d4419503d738e09" not found
machine-config-daemon-l48fv
E0418 21:23:05.885322    5588 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-5600e489f20c5c6a4d4419503d738e09" not found
machine-config-daemon-n46fn
E0418 21:23:06.733116    2867 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-worker-32f2cd05a2b97700624d8f5b231a56e8" not found
machine-config-daemon-wpdpv
E0418 21:23:07.326743    2522 writer.go:119] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-worker-32f2cd05a2b97700624d8f5b231a56e8" not found

$ oc get nodes -oyaml | grep -e "    name:" -e machineconfig
      machineconfiguration.openshift.io/currentConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09
      machineconfiguration.openshift.io/desiredConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09
      machineconfiguration.openshift.io/state: Degraded
    name: master-0
      machineconfiguration.openshift.io/currentConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09
      machineconfiguration.openshift.io/desiredConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09
      machineconfiguration.openshift.io/state: Degraded
    name: master-1
      machineconfiguration.openshift.io/currentConfig: rendered-master-5600e489f20c5c6a4d4419503d738e09
      machineconfiguration.openshift.io/desiredConfig: rendered-master-2a44c08aee389f30a3b9fac27fc7b655
      machineconfiguration.openshift.io/state: Degraded
    name: master-2
      machineconfiguration.openshift.io/currentConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8
      machineconfiguration.openshift.io/desiredConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8
      machineconfiguration.openshift.io/state: Degraded
    name: worker-0
      machineconfiguration.openshift.io/currentConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8
      machineconfiguration.openshift.io/desiredConfig: rendered-worker-385d6756366aa653245dfd791f5d348c
      machineconfiguration.openshift.io/state: Degraded
    name: worker-1
      machineconfiguration.openshift.io/currentConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8
      machineconfiguration.openshift.io/desiredConfig: rendered-worker-32f2cd05a2b97700624d8f5b231a56e8
      machineconfiguration.openshift.io/state: Degraded
    name: worker-2

Comment 1 Antonio Murdaca 2019-04-18 22:19:41 UTC

(In reply to Seth Jennings from comment #0)
> I thought we had this fixed but I guess not.  Nodes going degraded and MCO
> jammed up because node's currentConfig has been deleted.
> 

Degraded is _fixed_ as in, the MCD keeps retrying but this a condition where we can't really reconcile in any way so there's little we can fix...

This is caused by a drift between installer bootstrap and in-cluster MCO bringup.

I can't see how this failed and the release page showing a green build for that payload https://origin-release.svc.ci.openshift.org/

You're using 4.1.0-0.okd-2019-04-18-203943 which is green.

Comment 2 Antonio Murdaca 2019-04-18 22:33:36 UTC

also, is this a plain install from installer master with that payload? has something changed in bootstrap installer which causes a drift in what the MCO has in-cluster?

Is installing from that payload 100% reproducible?

Comment 3 Antonio Murdaca 2019-04-18 22:53:51 UTC

Last thing to check:

- installer version

did you build installer master and just run it? or did you grab that payload, extracted the installer and installed from that payload's installer?

Comment 4 Antonio Murdaca 2019-04-18 22:54:24 UTC

is this libvirt or aws?

Comment 5 Antonio Murdaca 2019-04-18 23:06:45 UTC

Ok, clarified on Slack this is a bare metal install, copying what I wrote on Slack:

@sjenning I'm not sure on how to proceed with that bug tho, it looks like the MCs generated at install bootstrap differs for the ones being generated once the MCO goes on in the cluster, if you have any idea that might help

Comment 6 Antonio Murdaca 2019-04-18 23:22:19 UTC

As per slack conversation, this was the result of a certificate change for kubeconfig so something not reproducible on builds (just a test I guess). We'll follow up on github/slack to assist on the change.

Comment 7 Seth Jennings 2019-04-18 23:23:12 UTC

Ok, turns out this is caused by skew between the installer and MCO.

Basically the MCO assumes that it can reconstruct the exact MC that the bootstrap process constructs.

If there is any mismatch, cluster is hosed on arrival with all nodes in unrecoverable Degraded state.

In this particular case, I was testing a change in the installer, resulting in the skew.

I'll close since I introduced the skew but this is brittle...