Bug 1701409
Summary: | All nodes degraded and MCO Available=False on fresh install | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Seth Jennings <sjenning> |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED NOTABUG | QA Contact: | Micah Abbott <miabbott> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | fshaikh, kgarriso |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-11-07 17:36:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Seth Jennings
2019-04-18 21:35:09 UTC
(In reply to Seth Jennings from comment #0) > I thought we had this fixed but I guess not. Nodes going degraded and MCO > jammed up because node's currentConfig has been deleted. > Degraded is _fixed_ as in, the MCD keeps retrying but this a condition where we can't really reconcile in any way so there's little we can fix... This is caused by a drift between installer bootstrap and in-cluster MCO bringup. I can't see how this failed and the release page showing a green build for that payload https://origin-release.svc.ci.openshift.org/ You're using 4.1.0-0.okd-2019-04-18-203943 which is green. also, is this a plain install from installer master with that payload? has something changed in bootstrap installer which causes a drift in what the MCO has in-cluster? Is installing from that payload 100% reproducible? Last thing to check: - installer version did you build installer master and just run it? or did you grab that payload, extracted the installer and installed from that payload's installer? is this libvirt or aws? Ok, clarified on Slack this is a bare metal install, copying what I wrote on Slack: @sjenning I'm not sure on how to proceed with that bug tho, it looks like the MCs generated at install bootstrap differs for the ones being generated once the MCO goes on in the cluster, if you have any idea that might help As per slack conversation, this was the result of a certificate change for kubeconfig so something not reproducible on builds (just a test I guess). We'll follow up on github/slack to assist on the change. Ok, turns out this is caused by skew between the installer and MCO. Basically the MCO assumes that it can reconstruct the exact MC that the bootstrap process constructs. If there is any mismatch, cluster is hosed on arrival with all nodes in unrecoverable Degraded state. In this particular case, I was testing a change in the installer, resulting in the skew. I'll close since I introduced the skew but this is brittle... |