1920027 – machine-config-operator consistently failing during 4.6 to 4.7 upgrades and clusters do not install successfully with proxy configuration

Bug 1920027 - machine-config-operator consistently failing during 4.6 to 4.7 upgrades and clusters do not install successfully with proxy configuration

Summary: machine-config-operator consistently failing during 4.6 to 4.7 upgrades and c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ben Howard
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:	TechnicalReleaseBlocker
Duplicates (5):	1920483 1922127 1922187 1922238 1926474 (view as bug list)
Depends On:
Blocks:	1915235 1933075 1978041
TreeView+	depends on / blocked

Reported:	2021-01-25 15:15 UTC by Fabian von Feilitzsch
Modified:	2021-07-01 03:53 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1933075 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:55:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2363	None	closed	Bug 1920027: daemon: handle zero-length dropins/units	2021-02-18 16:07:55 UTC
Github	openshift machine-config-operator pull 2365	None	closed	Bug 1920027: use separate dropin files for kubelet	2021-02-18 16:07:55 UTC
Github	openshift machine-config-operator pull 2378	None	closed	Bug 1920027: templates: split crio dropins into separate files	2021-02-18 16:07:54 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:56:25 UTC

Description Fabian von Feilitzsch 2021-01-25 15:15:48 UTC

Description of problem:
During upgrades from 4.6 to 4.7, machine-config-operator fails to progress.


Example failing job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1353402225170321408

The status of the machine-config cluster operator contains:
```
      {
        "type": "Upgradeable",
        "status": "False",
        "lastTransitionTime": "2021-01-24T19:31:41Z",
        "reason": "One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading",
        "message": "Cluster operator machine-config cannot be upgraded between minor versions: "
      }
```

The MachineConfigPool status reports:

NodeDegraded: True

reason: "1 nodes are reporting degraded status on sync"

message: "Node ip-10-0-245-129.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-12ac9ddfdf7cec94b7b5ae2469b3fc5a: content mismatch for file \\\"/etc/systemd/system/pivot.service.d/10-mco-default-env.conf\\\"\""


Version-Release number of selected component (if applicable):
4.7


Additional info:
This is currently blocking CI and nightly payload promotion.

The first CI payload we see it show up in is here: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.ci/release/4.7.0-0.ci-2021-01-24-175228

That payload contained this PR: https://github.com/openshift/machine-config-operator/pull/2310
Which seems like the most likely culprit among all the listed changes

Comment 1 Yu Qi Zhang 2021-01-25 17:31:20 UTC

I think this is most likely a regression from https://github.com/openshift/machine-config-operator/commit/fbf712a5fdd577d07d65bacdfe3c1bb2c46a6df7#diff-9c6641c1f9cfb0c678ea58b1a913a5d0d528e98047e04827dcf28e9d6e51e8ee at a glance, due to:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1353402225170321408/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-549x6_machine-config-daemon.log

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1353402225170321408/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-cf5k6_machine-config-daemon.log

Assigning to ben

Comment 2 Ben Howard 2021-01-25 18:39:49 UTC

Interesting. The contents of [1] `pivot.service.yaml` is:

name: pivot.service
dropins:
  - name: 10-mco-default-env.conf
    contents: |
      {{if .Proxy -}}
      [Service]
      EnvironmentFile=/etc/mco/proxy.env
      {{end -}}

However, the diff yields `[]unit8{...}` and when converted to text you get "[Unit]". 


[1] https://github.com/openshift/machine-config-operator/blob/130722159901d909a64fe9781a2ae78d96fd47e3/templates/common/_base/units/pivot.service.yaml#L1-L8
[2] https://play.golang.org/p/3pUfdk-0N8R

Comment 3 Yu Qi Zhang 2021-01-25 19:45:56 UTC

I think I know what's going on, the diff you see above is due to the [Unit] being removed in https://github.com/openshift/machine-config-operator/commit/fbf712a5fdd577d07d65bacdfe3c1bb2c46a6df7#diff-c70a0bfc46a1f4b7c0898eef8f9f84ae68875eeecd34b7e6ed2c9ef2bfdef802L5, see how it used to have [Unit] still when it was empty,

Compound that with https://github.com/openshift/machine-config-operator/commit/0ad77557399cabc276750d35262632e04eae5da9 which skipped the write BUT did not update https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/update.go#L1369

Which means the write was skipped entirely (the file was the same as it was pre-update), skipped the delete (since its technically still there), so it thinks it should be empty but still has [Unit] in it.

Either an update to the write (don't skip if empty, just write empty file) or an update to delete (if empty, delete) should be fine.

Comment 4 Ben Howard 2021-01-27 22:55:32 UTC

*** Bug 1920483 has been marked as a duplicate of this bug. ***

Comment 5 RamaKasturi 2021-01-28 06:35:00 UTC

Hit similar issue on profile 14_Disconnected IPI on Azure & Private Cluster. Below is the link to must-gather.

http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1920027/must-gather.local.4604239068622020283.tar.gz

Comment 6 Michael Gugino 2021-01-29 15:39:19 UTC

*** Bug 1922238 has been marked as a duplicate of this bug. ***

Comment 7 W. Trevor King 2021-01-29 22:20:20 UTC

*** Bug 1922127 has been marked as a duplicate of this bug. ***

Comment 9 Gaoyun Pei 2021-02-01 03:41:28 UTC

We might forget the same separating dropin files work for cri-o service as https://github.com/openshift/machine-config-operator/pull/2365.

In proxy enabled cluster, the cri-o service didn't get expected "EnvironmentFile=/etc/mco/proxy.env" configured in the 10-mco-default-env.conf dropin file.
[root@control-plane-0 crio.service.d]# cat 10-mco-default-env.conf 
[Service]
Environment="GODEBUG=x509ignoreCN=0,madvdontneed=1"

Comment 11 jima 2021-02-03 03:18:38 UTC

Verified on upi-on-vsphere behind proxy with 4.7.0-0.nightly-2021-02-02-223803. fresh installation is successful.

Comment 12 Ben Howard 2021-02-03 18:17:04 UTC

*** Bug 1922187 has been marked as a duplicate of this bug. ***

Comment 13 Weinan Liu 2021-02-04 07:32:14 UTC

4.6.13-x86_64	4.7.0-0.nightly-2021-02-03-165316
4.6.16-x86_64	4.7.0-0.nightly-2021-02-03-165316
Verified to be fixed for both of the pathes

Comment 14 Gaoyun Pei 2021-02-04 07:36:57 UTC

Upgrade 4.6 cluster with proxy enabled from 4.6.16 to 4.7.0-0.nightly-2021-02-04-012305 finished successfully.
No machine-config operator degraded issue.

Comment 15 To Hung Sze 2021-02-09 22:39:51 UTC

*** Bug 1926474 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2021-02-24 15:55:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 19 W. Trevor King 2021-04-05 17:48:08 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.