Bug 1920027

Summary: machine-config-operator consistently failing during 4.6 to 4.7 upgrades and clusters do not install successfully with proxy configuration
Product: OpenShift Container Platform Reporter: Fabian von Feilitzsch <fabian>
Component: Machine Config OperatorAssignee: Ben Howard <behoward>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.7CC: ccoleman, esimard, fdeutsch, gpei, jerzhang, jhou, jima, knarra, lmohanty, mgugino, mkrejci, pmuller, tsze, weinliu, wking, wsun, yanyang, yunjiang
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: TechnicalReleaseBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1933075 (view as bug list) Environment:
Last Closed: 2021-02-24 15:55:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1915235, 1933075, 1978041    

Description Fabian von Feilitzsch 2021-01-25 15:15:48 UTC
Description of problem:
During upgrades from 4.6 to 4.7, machine-config-operator fails to progress.


Example failing job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1353402225170321408

The status of the machine-config cluster operator contains:
```
      {
        "type": "Upgradeable",
        "status": "False",
        "lastTransitionTime": "2021-01-24T19:31:41Z",
        "reason": "One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading",
        "message": "Cluster operator machine-config cannot be upgraded between minor versions: "
      }
```

The MachineConfigPool status reports:

NodeDegraded: True

reason: "1 nodes are reporting degraded status on sync"

message: "Node ip-10-0-245-129.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against rendered-master-12ac9ddfdf7cec94b7b5ae2469b3fc5a: content mismatch for file \\\"/etc/systemd/system/pivot.service.d/10-mco-default-env.conf\\\"\""


Version-Release number of selected component (if applicable):
4.7


Additional info:
This is currently blocking CI and nightly payload promotion.

The first CI payload we see it show up in is here: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.ci/release/4.7.0-0.ci-2021-01-24-175228

That payload contained this PR: https://github.com/openshift/machine-config-operator/pull/2310
Which seems like the most likely culprit among all the listed changes

Comment 2 Ben Howard 2021-01-25 18:39:49 UTC
Interesting. The contents of [1] `pivot.service.yaml` is:

name: pivot.service
dropins:
  - name: 10-mco-default-env.conf
    contents: |
      {{if .Proxy -}}
      [Service]
      EnvironmentFile=/etc/mco/proxy.env
      {{end -}}

However, the diff yields `[]unit8{...}` and when converted to text you get "[Unit]". 


[1] https://github.com/openshift/machine-config-operator/blob/130722159901d909a64fe9781a2ae78d96fd47e3/templates/common/_base/units/pivot.service.yaml#L1-L8
[2] https://play.golang.org/p/3pUfdk-0N8R

Comment 3 Yu Qi Zhang 2021-01-25 19:45:56 UTC
I think I know what's going on, the diff you see above is due to the [Unit] being removed in https://github.com/openshift/machine-config-operator/commit/fbf712a5fdd577d07d65bacdfe3c1bb2c46a6df7#diff-c70a0bfc46a1f4b7c0898eef8f9f84ae68875eeecd34b7e6ed2c9ef2bfdef802L5, see how it used to have [Unit] still when it was empty,

Compound that with https://github.com/openshift/machine-config-operator/commit/0ad77557399cabc276750d35262632e04eae5da9 which skipped the write BUT did not update https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/update.go#L1369

Which means the write was skipped entirely (the file was the same as it was pre-update), skipped the delete (since its technically still there), so it thinks it should be empty but still has [Unit] in it.

Either an update to the write (don't skip if empty, just write empty file) or an update to delete (if empty, delete) should be fine.

Comment 4 Ben Howard 2021-01-27 22:55:32 UTC
*** Bug 1920483 has been marked as a duplicate of this bug. ***

Comment 5 RamaKasturi 2021-01-28 06:35:00 UTC
Hit similar issue on profile 14_Disconnected IPI on Azure & Private Cluster. Below is the link to must-gather.

http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1920027/must-gather.local.4604239068622020283.tar.gz

Comment 6 Michael Gugino 2021-01-29 15:39:19 UTC
*** Bug 1922238 has been marked as a duplicate of this bug. ***

Comment 7 W. Trevor King 2021-01-29 22:20:20 UTC
*** Bug 1922127 has been marked as a duplicate of this bug. ***

Comment 9 Gaoyun Pei 2021-02-01 03:41:28 UTC
We might forget the same separating dropin files work for cri-o service as https://github.com/openshift/machine-config-operator/pull/2365.

In proxy enabled cluster, the cri-o service didn't get expected "EnvironmentFile=/etc/mco/proxy.env" configured in the 10-mco-default-env.conf dropin file.
[root@control-plane-0 crio.service.d]# cat 10-mco-default-env.conf 
[Service]
Environment="GODEBUG=x509ignoreCN=0,madvdontneed=1"

Comment 11 jima 2021-02-03 03:18:38 UTC
Verified on upi-on-vsphere behind proxy with 4.7.0-0.nightly-2021-02-02-223803. fresh installation is successful.

Comment 12 Ben Howard 2021-02-03 18:17:04 UTC
*** Bug 1922187 has been marked as a duplicate of this bug. ***

Comment 13 Weinan Liu 2021-02-04 07:32:14 UTC
4.6.13-x86_64	4.7.0-0.nightly-2021-02-03-165316
4.6.16-x86_64	4.7.0-0.nightly-2021-02-03-165316
Verified to be fixed for both of the pathes

Comment 14 Gaoyun Pei 2021-02-04 07:36:57 UTC
Upgrade 4.6 cluster with proxy enabled from 4.6.16 to 4.7.0-0.nightly-2021-02-04-012305 finished successfully.
No machine-config operator degraded issue.

Comment 15 To Hung Sze 2021-02-09 22:39:51 UTC
*** Bug 1926474 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2021-02-24 15:55:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 19 W. Trevor King 2021-04-05 17:48:08 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475