+++ This bug was initially created as a clone of Bug #1773870 +++ Description of problem: After an upgrade from 4.1.23 -> 4.2.4, the network operator is found to be crashlooping. —- This bug should have been fixed by having a component migrate the status automatically when we changed the schema. The team that makes schema changes to core config status is responsible for migrating and backfilling values like this. We need to prevent this from happening globally, not just operator by operator. We need to review all global config changes for 4.1 to 4.2 and 4.2 to 4.3 to ensure this doesn’t happen across all components. This is assigned to installer because they set the field, but all impacted teams need to review. Marking urgent, we broke our api contract, it broke users, and we now have 4.3 upgrades that may never have been tested against a 4.1 installed system. This may not be deferred from 4.3 before GA
All components need to: 1. Review their status or spec fields in global config that changed from 4.1,0 on 2. Identify any fields that should have been filled in during/after upgrade 3. Test a 4.1 to 4.2 to 4.3 upgrade on any configuration that is relevant 4. Normalize current status to the appropriate value.
We also need a CI job that installs 4.1 and upgrades serially.
In case it falls under the same scope, BZ 1786246 is another case where 4.1 settings break under 4.2. In that case the Jenkins image can no longer be pulled with settings that were set under 4.1 (presumably by the installer). Ideally it would be caught by a CI job such as the one proposed. But it's hard to test everything so I wanted to raise awareness in case it's in scope.
*** Bug 1779299 has been marked as a duplicate of this bug. ***
Comparing 4.2->4.3 vs. 4.3 (using 4.3->4.3 as a stand-in for a raw 4.3 job, because it's easier for me to find update jobs). Looking at *->4.3.3 update CI [1] and picking successful jobs: * 4.2.20 -> 4.3.3 [2]. Drilling down to the config.openshift.io directory [3]. * 4.3.0 -> 4.3.3 [4]. Drilling down to the config.openshift.io directory [5]. Fetching: $ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/ $ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/ Diffing, and removing fields with timestamps and other expected divergence: $ diff -ru storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/* | $ grep '^+\|^-' | grep -v '^---\|^+++\|resourceVersion\|creationTimestamp\|lastTransitionTime:\| uid:\|apiServerInternalURI:\|apiServerURL:\|infrastructureName:\|versionHash:\|clusterID:\|consoleURL:\|baseDomain:\|ci-op-\|message:\|lastReportTime:\|domain:\|etcdDiscoveryDomain:\|startedTime:\|completionTime:\|are at latest configuration\|/release@sha256' | sort | uniq +--- - 4.3.3 not found in the "stable-4.2" channel' + 4.3.3 not found in the "stable-4.3" channel' - channel: stable-4.2 - channel: stable-4.2 + channel: stable-4.3 + channel: stable-4.3 - finishes. - finishes. - image: registry.svc.ci.openshift.org/ocp/release:4.3.3 - image: registry.svc.ci.openshift.org/ocp/release:4.3.3 - name: certified-operators + name: certified-operators - name: community-operators + name: community-operators - - name: operator - - name: operator + - name: operator + - name: operator - name: redhat-operators + name: redhat-operators - not found in the "stable-4.2" channel' + not found in the "stable-4.3" channel' - reason: RollOutInProgress - reason: RollOutInProgress - region: us-east-2 - region: us-east-2 + region: us-west-1 + region: us-west-1 - verified: false - verified: false + verified: true + verified: true - version: 4.2.20 - version: 4.2.20 + version: 4.3.0 + version: 4.3.0 - version: 4.3.3 - version: 4.3.3 + version: 4.3.3 + version: 4.3.3 And the bulk of those changes are just unstable ordering. For example: $ diff -u $(grep -rl redhat-operators storage.googleapis.com) --- storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/operatorhubs.yaml 2020-02-19 13:59:43.000000000 -0800 +++ storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/operatorhubs.yaml 2020-02-19 10:55:21.000000000 -0800 @@ -6,26 +6,26 @@ metadata: annotations: release.openshift.io/create-only: "true" - creationTimestamp: "2020-02-19T20:51:51Z" + creationTimestamp: "2020-02-19T17:42:22Z" generation: 1 name: cluster - resourceVersion: "23400" + resourceVersion: "47685" selfLink: /apis/config.openshift.io/v1/operatorhubs/cluster - uid: e921c768-d891-4ea9-9047-2099d9a7c912 + uid: 2eedf533-533f-11ea-bf80-02b1edf5936c spec: {} status: sources: - disabled: false - name: community-operators - status: Success - - disabled: false name: redhat-operators status: Success - disabled: false name: certified-operators status: Success + - disabled: false + name: community-operators + status: Success kind: OperatorHubList metadata: continue: "" - resourceVersion: "43003" + resourceVersion: "53562" selfLink: /apis/config.openshift.io/v1/operatorhubs Still would be nice to check 4.3->4.3, 4.1->4.2->4.3->4.4, etc. [1]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.3.3#upgrades-from [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062 [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/ [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080 [5]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/
Repeating the above with *->4.2.4 [1]. * 4.1.23 -> 4.2.4 [2]. Drilling down to the config.openshift.io directory [3]. * 4.2.2 -> 4.2.4 [4]. Drilling down to the config.openshift.io directory [5]. Fetching: $ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10737/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/ $ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10738/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/ Diffing, and removing fields with timestamps and other expected divergence (and removing a $ from '$ grep' from a sloppy copy/paste from my previous comment): $ diff -ru storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/* | grep '^+\|^-' | grep -v '^---\|^+++\|resourceVersion\|creationTimestamp\|lastTransitionTime:\| uid:\|apiServerInternalURI:\|apiServerURL:\|infrastructureName:\|versionHash:\|clusterID:\|consoleURL:\|baseDomain:\|ci-op-\|message:\|lastReportTime:\|domain:\|etcdDiscoveryDomain:\|startedTime:\|completionTime:\|are at latest configuration\|/release@sha256' | sort | uniq - (0.3) - 4.2.4 not found in the "stable-4.1" channel' + 4.2.4 not found in the "stable-4.2" channel' - annotations: - annotations: + aws: + aws: - channel: stable-4.1 - channel: stable-4.1 + channel: stable-4.2 + channel: stable-4.2 + externalIP: - - group: cloudcredential.openshift.io - - group: cloudcredential.openshift.io + mastersSchedulable: false + name: "" + name: "" - name: certified-operators + name: certified-operators - name: community-operators + name: community-operators - - name: kube-apiserver - - name: kube-apiserver + - name: kube-apiserver + - name: kube-apiserver - - name: kube-controller-manager - - name: kube-controller-manager + - name: kube-controller-manager + - name: kube-controller-manager - - name: oauth-openshift - - name: oauth-openshift + - name: oauth-openshift + - name: oauth-openshift - name: openshift-machine-api - name: openshift-machine-api - name: redhat-operators + name: redhat-operators - namespace: openshift-cloud-credential-operator - namespace: openshift-cloud-credential-operator - not found in the "stable-4.1" channel' + not found in the "stable-4.2" channel' + platformStatus: + platformStatus: + policy: {} + policy: - reason: AsExpected - reason: AsExpected + reason: AsExpected + reason: AsExpected - reason: OperandTransitionsSucceeding - reason: OperandTransitionsSucceeding + reason: OperandTransitionsSucceeding + reason: OperandTransitionsSucceeding + region: us-east-1 + region: us-east-1 - release.openshift.io/create-only: "true" - release.openshift.io/create-only: "true" - resource: CredentialsRequest - resource: CredentialsRequest - spec: {} -spec: {} + spec: +spec: + status: {} +status: {} - status: "False" - status: "False" + status: "False" + status: "False" - status: "True" - status: "True" + status: "True" + status: "True" + trustedCA: + trustedCA: + type: AWS + type: AWS - type: Degraded - type: Degraded + type: Degraded + type: Degraded - type: Progressing - type: Progressing + type: Progressing + type: Progressing - type: Upgradeable - type: Upgradeable + type: Upgradeable + type: Upgradeable - version: 1.14.6 - version: 1.14.6 + version: 1.14.6 + version: 1.14.6 - version: 4.1.23 - version: 4.1.23 + version: 4.2.2 + version: 4.2.2 - version: 4.2.4_openshift - version: 4.2.4_openshift + version: 4.2.4_openshift + version: 4.2.4_openshift so you can see the platformStatus bit from bug 1773870. [1]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.2.4#upgrades-from [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10737 [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10737/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/ [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10738 [5]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10738/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/
We also have chained update jobs, although on nightlies, not release candidates. E.g. here's 4.1->4.2->4.3 [1]. Currently nothing that includes 4.4 in that chain yet, but we can probably bump that since we've had 4.4 nightlies for a while now. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/10
We've performed a one time audit of config drift between upgrade and greenfield installations and identified only the platform issue previously identified. The team will work on continuing the upgrade chaining work that Clayton start in the 4.1 to 4.2 to 4.3 upgrade jobs and that work will be tracked via Jira. If not completed ahead of 4.5 we'll track this as a 4.5 blocker bug to audit again.
We now have 4.1->4.2->4.3->4.4 CI since [1]. [1]: https://github.com/openshift/release/pull/7230
we have CI jobs that test this now.