Bug 1787765

Summary: Audit for schema/defaulting config changes between releases
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: aconstan, adahiya, arghosh, ccoleman, ChetRHosey, jmalde, lmohanty, nagrawal, openshift-bugs-escalate, ricarril, ssadhale, svaughn, trees, wking, zzhao
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1773870 Environment:
Last Closed: 2020-04-27 17:37:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1773870    
Bug Blocks: 1778235, 1781558    

Description Clayton Coleman 2020-01-05 06:46:53 UTC
+++ This bug was initially created as a clone of Bug #1773870 +++

Description of problem:

After an upgrade from 4.1.23 -> 4.2.4, the network operator is found to be crashlooping.


—-


This bug should have been fixed by having a component migrate the status automatically when we changed the schema.  The team that makes schema changes to core config status is responsible for migrating and backfilling values like this.

We need to prevent this from happening globally, not just operator by operator.

We need to review all global config changes for 4.1 to 4.2 and 4.2 to 4.3 to ensure this doesn’t happen across all components.

This is assigned to installer because they set the field, but all impacted teams need to review.

Marking urgent, we broke our api contract, it broke users, and we now have 4.3 upgrades that may never have been tested against a 4.1 installed system.

This may not be deferred from 4.3 before GA

Comment 1 Clayton Coleman 2020-01-05 16:30:42 UTC
All components need to:

1. Review their status or spec fields in global config that changed from 4.1,0 on
2. Identify any fields that should have been filled in during/after upgrade
3. Test a 4.1 to 4.2 to 4.3 upgrade on any configuration that is relevant
4. Normalize current status to the appropriate value.

Comment 2 Clayton Coleman 2020-01-05 16:32:05 UTC
We also need a CI job that installs 4.1 and upgrades serially.

Comment 3 Chet Hosey 2020-01-06 07:06:22 UTC
In case it falls under the same scope, BZ 1786246 is another case where 4.1 settings break under 4.2. In that case the Jenkins image can no longer be pulled with settings that were set under 4.1 (presumably by the installer).

Ideally it would be caught by a CI job such as the one proposed. But it's hard to test everything so I wanted to raise awareness in case it's in scope.

Comment 4 Daneyon Hansen 2020-01-13 18:29:07 UTC
*** Bug 1779299 has been marked as a duplicate of this bug. ***

Comment 6 W. Trevor King 2020-02-24 17:06:28 UTC
Comparing 4.2->4.3 vs. 4.3 (using 4.3->4.3 as a stand-in for a raw 4.3 job, because it's easier for me to find update jobs).  Looking at *->4.3.3 update CI [1] and picking successful jobs:

* 4.2.20 -> 4.3.3 [2].  Drilling down to the config.openshift.io directory [3].
* 4.3.0 -> 4.3.3 [4].  Drilling down to the config.openshift.io directory [5].

Fetching:

$ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/
$ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/

Diffing, and removing fields with timestamps and other expected divergence:

$ diff -ru storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/* | $ grep '^+\|^-' | grep -v '^---\|^+++\|resourceVersion\|creationTimestamp\|lastTransitionTime:\| uid:\|apiServerInternalURI:\|apiServerURL:\|infrastructureName:\|versionHash:\|clusterID:\|consoleURL:\|baseDomain:\|ci-op-\|message:\|lastReportTime:\|domain:\|etcdDiscoveryDomain:\|startedTime:\|completionTime:\|are at latest configuration\|/release@sha256' | sort | uniq
+---
-        4.3.3 not found in the "stable-4.2" channel'
+        4.3.3 not found in the "stable-4.3" channel'
-    channel: stable-4.2
-  channel: stable-4.2
+    channel: stable-4.3
+  channel: stable-4.3
-        finishes.
-      finishes.
-      image: registry.svc.ci.openshift.org/ocp/release:4.3.3
-    image: registry.svc.ci.openshift.org/ocp/release:4.3.3
-      name: certified-operators
+      name: certified-operators
-      name: community-operators
+      name: community-operators
-    - name: operator
-  - name: operator
+    - name: operator
+  - name: operator
-      name: redhat-operators
+      name: redhat-operators
-      not found in the "stable-4.2" channel'
+      not found in the "stable-4.3" channel'
-      reason: RollOutInProgress
-    reason: RollOutInProgress
-        region: us-east-2
-      region: us-east-2
+        region: us-west-1
+      region: us-west-1
-      verified: false
-    verified: false
+      verified: true
+    verified: true
-      version: 4.2.20
-    version: 4.2.20
+      version: 4.3.0
+    version: 4.3.0
-      version: 4.3.3
-    version: 4.3.3
+      version: 4.3.3
+    version: 4.3.3

And the bulk of those changes are just unstable ordering.  For example:

$ diff -u $(grep -rl redhat-operators storage.googleapis.com)
--- storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/operatorhubs.yaml	2020-02-19 13:59:43.000000000 -0800
+++ storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/operatorhubs.yaml	2020-02-19 10:55:21.000000000 -0800
@@ -6,26 +6,26 @@
   metadata:
     annotations:
       release.openshift.io/create-only: "true"
-    creationTimestamp: "2020-02-19T20:51:51Z"
+    creationTimestamp: "2020-02-19T17:42:22Z"
     generation: 1
     name: cluster
-    resourceVersion: "23400"
+    resourceVersion: "47685"
     selfLink: /apis/config.openshift.io/v1/operatorhubs/cluster
-    uid: e921c768-d891-4ea9-9047-2099d9a7c912
+    uid: 2eedf533-533f-11ea-bf80-02b1edf5936c
   spec: {}
   status:
     sources:
     - disabled: false
-      name: community-operators
-      status: Success
-    - disabled: false
       name: redhat-operators
       status: Success
     - disabled: false
       name: certified-operators
       status: Success
+    - disabled: false
+      name: community-operators
+      status: Success
 kind: OperatorHubList
 metadata:
   continue: ""
-  resourceVersion: "43003"
+  resourceVersion: "53562"
   selfLink: /apis/config.openshift.io/v1/operatorhubs

Still would be nice to check 4.3->4.3, 4.1->4.2->4.3->4.4, etc.

[1]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.3.3#upgrades-from
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18062/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080
[5]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/18080/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8c22ff0f9629be3dab2f1f9b773ae800438e7db3a57e1d422aa6ee9a8ff6abfc/cluster-scoped-resources/config.openshift.io/

Comment 7 W. Trevor King 2020-02-24 17:16:30 UTC
Repeating the above with *->4.2.4 [1].

* 4.1.23 -> 4.2.4 [2].  Drilling down to the config.openshift.io directory [3].
* 4.2.2 -> 4.2.4 [4].  Drilling down to the config.openshift.io directory [5].

Fetching:

$ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10737/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/
$ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10738/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/

Diffing, and removing fields with timestamps and other expected divergence (and removing a $ from '$ grep' from a sloppy copy/paste from my previous comment):

$ diff -ru storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/* | grep '^+\|^-' | grep -v '^---\|^+++\|resourceVersion\|creationTimestamp\|lastTransitionTime:\| uid:\|apiServerInternalURI:\|apiServerURL:\|infrastructureName:\|versionHash:\|clusterID:\|consoleURL:\|baseDomain:\|ci-op-\|message:\|lastReportTime:\|domain:\|etcdDiscoveryDomain:\|startedTime:\|completionTime:\|are at latest configuration\|/release@sha256' | sort | uniq
-        (0.3)
-        4.2.4 not found in the "stable-4.1" channel'
+        4.2.4 not found in the "stable-4.2" channel'
-    annotations:
-  annotations:
+      aws:
+    aws:
-    channel: stable-4.1
-  channel: stable-4.1
+    channel: stable-4.2
+  channel: stable-4.2
+    externalIP:
-    - group: cloudcredential.openshift.io
-  - group: cloudcredential.openshift.io
+    mastersSchedulable: false
+      name: ""
+    name: ""
-      name: certified-operators
+      name: certified-operators
-      name: community-operators
+      name: community-operators
-    - name: kube-apiserver
-  - name: kube-apiserver
+    - name: kube-apiserver
+  - name: kube-apiserver
-    - name: kube-controller-manager
-  - name: kube-controller-manager
+    - name: kube-controller-manager
+  - name: kube-controller-manager
-    - name: oauth-openshift
-  - name: oauth-openshift
+    - name: oauth-openshift
+  - name: oauth-openshift
-      name: openshift-machine-api
-    name: openshift-machine-api
-      name: redhat-operators
+      name: redhat-operators
-      namespace: openshift-cloud-credential-operator
-    namespace: openshift-cloud-credential-operator
-      not found in the "stable-4.1" channel'
+      not found in the "stable-4.2" channel'
+    platformStatus:
+  platformStatus:
+      policy: {}
+    policy:
-      reason: AsExpected
-    reason: AsExpected
+      reason: AsExpected
+    reason: AsExpected
-      reason: OperandTransitionsSucceeding
-    reason: OperandTransitionsSucceeding
+      reason: OperandTransitionsSucceeding
+    reason: OperandTransitionsSucceeding
+        region: us-east-1
+      region: us-east-1
-      release.openshift.io/create-only: "true"
-    release.openshift.io/create-only: "true"
-      resource: CredentialsRequest
-    resource: CredentialsRequest
-  spec: {}
-spec: {}
+  spec:
+spec:
+  status: {}
+status: {}
-      status: "False"
-    status: "False"
+      status: "False"
+    status: "False"
-      status: "True"
-    status: "True"
+      status: "True"
+    status: "True"
+    trustedCA:
+  trustedCA:
+      type: AWS
+    type: AWS
-      type: Degraded
-    type: Degraded
+      type: Degraded
+    type: Degraded
-      type: Progressing
-    type: Progressing
+      type: Progressing
+    type: Progressing
-      type: Upgradeable
-    type: Upgradeable
+      type: Upgradeable
+    type: Upgradeable
-      version: 1.14.6
-    version: 1.14.6
+      version: 1.14.6
+    version: 1.14.6
-      version: 4.1.23
-    version: 4.1.23
+      version: 4.2.2
+    version: 4.2.2
-      version: 4.2.4_openshift
-    version: 4.2.4_openshift
+      version: 4.2.4_openshift
+    version: 4.2.4_openshift

so you can see the platformStatus bit from bug 1773870.

[1]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.2.4#upgrades-from
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10737
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10737/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10738
[5]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10738/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2bebbc3d547d70cb8caea206a567642f5ab1c7e098ddba55bf7e64b5c58534f2/cluster-scoped-resources/config.openshift.io/

Comment 8 W. Trevor King 2020-02-24 17:22:02 UTC
We also have chained update jobs, although on nightlies, not release candidates.  E.g. here's 4.1->4.2->4.3 [1].  Currently nothing that includes 4.4 in that chain yet, but we can probably bump that since we've had 4.4 nightlies for a while now.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/10

Comment 9 Scott Dodson 2020-02-26 20:17:31 UTC
We've performed a one time audit of config drift between upgrade and greenfield installations and identified only the platform issue previously identified. The team will work on continuing the upgrade chaining work that Clayton start in the 4.1 to 4.2 to 4.3 upgrade jobs and that work will be tracked via Jira. If not completed ahead of 4.5 we'll track this as a 4.5 blocker bug to audit again.

Comment 10 W. Trevor King 2020-03-18 21:26:56 UTC
We now have 4.1->4.2->4.3->4.4 CI since [1].

[1]: https://github.com/openshift/release/pull/7230

Comment 11 Abhinav Dahiya 2020-04-27 17:37:14 UTC
we have CI jobs that test this now.