Bug 1706632
| Summary: | [upgrade] Upgrade didn't complete with one master still degraded | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
| Status: | CLOSED DUPLICATE | QA Contact: | Micah Abbott <miabbott> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1.0 | CC: | aos-bugs, cwalters, jokerman, mmccomas, nstielau, sponnaga, walters |
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-05-06 20:31:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2019-05-05 19:59:38 UTC
Also: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1208 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1212 Looks like 3/5 runs have failed so far, likely introduced between 5am EDT 05/04 and 9am EDT 05/04 Wait, wrong times. 14 UTC and 21 UTC Reasonable ranges for when regression was introduced: https://openshift-release.svc.ci.openshift.org/releasestream/4.1.0-0.nightly/release/4.1.0-0.nightly-2019-05-04-210601?from=4.1.0-0.nightly-2019-05-04-070249 or https://openshift-release.svc.ci.openshift.org/releasestream/4.1.0-0.ci/release/4.1.0-0.ci-2019-05-04-221522?from=4.1.0-0.ci-2019-05-04-132829 I see a few MCD changes and a machine-os-content change. As reported in Derek's BZ, I've spotted that we report to write a file, but on reboot, the file isn't there anymore and not sure it's a race in Runtimes/Kubernetes/chroot https://bugzilla.redhat.com/show_bug.cgi?id=1706606#c6 The failure tl;dr; is that: - MCD writes files and upgrade os, then write a file "pending config" - reboot - MCD starts again, and uses pending config to finalize the upgrade What's I'm witnessing is that that file isn't on disk anymore after the reboot (and I'm not sure why is that...) This PR is removing all mounts that we don't need (all except the rootfs which we chroot into anyway): https://github.com/openshift/machine-config-operator/pull/704 My theory is that since the pending config is under /etc/machine-config-daemon and that's a mount in current code, we might experience a race (since this failure isn't consistent). This error and Derek's BZ error predates that range in https://bugzilla.redhat.com/show_bug.cgi?id=1706632#c1 I can see jobs failing with that on 05/03/2019 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/697/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/148/ An os update is indeed at play here. One unfortunate thing is that we lost the original MCD logs since they scrolled off apparently. (Maybe we should back off retries in this case?) Looking into this more. In this BZ we're definitely booted into the osimageurl that's for the *desired* config and not the current. Why that is...I think Antonio is on the right track that this has something to do with the "pending config" file going wrong. But we need more logging. I guess though philosophically...since the MCD is now a controller, if we're booted into what should be the desired state, we could just carry on. (My worry though is reboot loops; see also https://github.com/openshift/machine-config-operator/pull/245 ) Pretty sure this is a dup of #1706606 *** This bug has been marked as a duplicate of bug 1706606 *** |