Hide Forgot
Description of problem: During upgrades from 4.6 to 4.7, machine-config-operator fails to progress. Example failing job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1353402225170321408 The status of the machine-config cluster operator contains: ``` { "type": "Upgradeable", "status": "False", "lastTransitionTime": "2021-01-24T19:31:41Z", "reason": "One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading", "message": "Cluster operator machine-config cannot be upgraded between minor versions: " } ``` The MachineConfigPool status reports: NodeDegraded: True reason: "1 nodes are reporting degraded status on sync" message: "Node ip-10-0-245-129.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-12ac9ddfdf7cec94b7b5ae2469b3fc5a: content mismatch for file \\\"/etc/systemd/system/pivot.service.d/10-mco-default-env.conf\\\"\"" Version-Release number of selected component (if applicable): 4.7 Additional info: This is currently blocking CI and nightly payload promotion. The first CI payload we see it show up in is here: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.ci/release/4.7.0-0.ci-2021-01-24-175228 That payload contained this PR: https://github.com/openshift/machine-config-operator/pull/2310 Which seems like the most likely culprit among all the listed changes
I think this is most likely a regression from https://github.com/openshift/machine-config-operator/commit/fbf712a5fdd577d07d65bacdfe3c1bb2c46a6df7#diff-9c6641c1f9cfb0c678ea58b1a913a5d0d528e98047e04827dcf28e9d6e51e8ee at a glance, due to: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1353402225170321408/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-549x6_machine-config-daemon.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1353402225170321408/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-cf5k6_machine-config-daemon.log Assigning to ben
Interesting. The contents of [1] `pivot.service.yaml` is: name: pivot.service dropins: - name: 10-mco-default-env.conf contents: | {{if .Proxy -}} [Service] EnvironmentFile=/etc/mco/proxy.env {{end -}} However, the diff yields `[]unit8{...}` and when converted to text you get "[Unit]". [1] https://github.com/openshift/machine-config-operator/blob/130722159901d909a64fe9781a2ae78d96fd47e3/templates/common/_base/units/pivot.service.yaml#L1-L8 [2] https://play.golang.org/p/3pUfdk-0N8R
I think I know what's going on, the diff you see above is due to the [Unit] being removed in https://github.com/openshift/machine-config-operator/commit/fbf712a5fdd577d07d65bacdfe3c1bb2c46a6df7#diff-c70a0bfc46a1f4b7c0898eef8f9f84ae68875eeecd34b7e6ed2c9ef2bfdef802L5, see how it used to have [Unit] still when it was empty, Compound that with https://github.com/openshift/machine-config-operator/commit/0ad77557399cabc276750d35262632e04eae5da9 which skipped the write BUT did not update https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/update.go#L1369 Which means the write was skipped entirely (the file was the same as it was pre-update), skipped the delete (since its technically still there), so it thinks it should be empty but still has [Unit] in it. Either an update to the write (don't skip if empty, just write empty file) or an update to delete (if empty, delete) should be fine.
*** Bug 1920483 has been marked as a duplicate of this bug. ***
Hit similar issue on profile 14_Disconnected IPI on Azure & Private Cluster. Below is the link to must-gather. http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1920027/must-gather.local.4604239068622020283.tar.gz
*** Bug 1922238 has been marked as a duplicate of this bug. ***
*** Bug 1922127 has been marked as a duplicate of this bug. ***
We might forget the same separating dropin files work for cri-o service as https://github.com/openshift/machine-config-operator/pull/2365. In proxy enabled cluster, the cri-o service didn't get expected "EnvironmentFile=/etc/mco/proxy.env" configured in the 10-mco-default-env.conf dropin file. [root@control-plane-0 crio.service.d]# cat 10-mco-default-env.conf [Service] Environment="GODEBUG=x509ignoreCN=0,madvdontneed=1"
Verified on upi-on-vsphere behind proxy with 4.7.0-0.nightly-2021-02-02-223803. fresh installation is successful.
*** Bug 1922187 has been marked as a duplicate of this bug. ***
4.6.13-x86_64 4.7.0-0.nightly-2021-02-03-165316 4.6.16-x86_64 4.7.0-0.nightly-2021-02-03-165316 Verified to be fixed for both of the pathes
Upgrade 4.6 cluster with proxy enabled from 4.6.16 to 4.7.0-0.nightly-2021-02-04-012305 finished successfully. No machine-config operator degraded issue.
*** Bug 1926474 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475