1877899 – [GCP] Upgrade from 4.5 to 4.6 then back to 4.5 doesn't complete

Bug 1877899 - [GCP] Upgrade from 4.5 to 4.6 then back to 4.5 doesn't complete

Summary: [GCP] Upgrade from 4.5 to 4.6 then back to 4.5 doesn't complete

Keywords:
Status:	CLOSED DUPLICATE of bug 1882394
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Over the Air Updates
QA Contact:	To Hung Sze
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-10 17:16 UTC by To Hung Sze
Modified:	2022-05-06 12:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-08 18:08:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description To Hung Sze 2020-09-10 17:16:43 UTC

Description of problem:
Upgrade from 4.5 to 4.6 then back to 4.5 doesn't complete.
dns, machine-config, networking, storage stays at 4.6

How reproducible:
Always

Steps to Reproduce:
1. Start with a good 4.5 installation (I used 4.5.0-0.nightly-2020-09-04-102546)
 
2. Upgrade to 4.6 nightly (I used 4.6.0-fc.4)
./oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.6.0-fc.4-x86_64 --force --allow-explicit-upgrade

3. check that upgrade finish without problem

4. Downgrade again to 4.5
./oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-09-04-102546 --force --allow-explicit-upgrade

Actual results:
./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.5.0-0.nightly-2020-09-04-102546: the cluster operator storage has not yet successfully rolled out

warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently installed version 4.5.0-0.nightly-2020-09-04-102546 not found in the "stable-4.5" channel

./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-09-04-102546   True        False         False      123m
cloud-credential                           4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h32m
cluster-autoscaler                         4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h23m
config-operator                            4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h24m
console                                    4.5.0-0.nightly-2020-09-04-102546   True        False         False      97m
csi-snapshot-controller                    4.5.0-0.nightly-2020-09-04-102546   True        False         False      125m
dns                                        4.6.0-fc.4                          True        False         False      142m
etcd                                       4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h28m
image-registry                             4.5.0-0.nightly-2020-09-04-102546   True        False         False      131m
ingress                                    4.5.0-0.nightly-2020-09-04-102546   True        False         False      98m
insights                                   4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h25m
kube-apiserver                             4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h27m
kube-controller-manager                    4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h28m
kube-scheduler                             4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h28m
kube-storage-version-migrator              4.5.0-0.nightly-2020-09-04-102546   True        False         False      131m
machine-api                                4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h22m
machine-approver                           4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h27m
machine-config                             4.6.0-fc.4                          True        False         False      122m
marketplace                                4.5.0-0.nightly-2020-09-04-102546   True        False         False      97m
monitoring                                 4.5.0-0.nightly-2020-09-04-102546   True        False         False      124m
network                                    4.6.0-fc.4                          True        False         False      3h30m
node-tuning                                4.5.0-0.nightly-2020-09-04-102546   True        False         False      99m
openshift-apiserver                        4.5.0-0.nightly-2020-09-04-102546   True        False         False      99m
openshift-controller-manager               4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h24m
openshift-samples                          4.5.0-0.nightly-2020-09-04-102546   True        False         False      98m
operator-lifecycle-manager                 4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h29m
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h29m
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-09-04-102546   True        False         False      98m
service-ca                                 4.5.0-0.nightly-2020-09-04-102546   True        False         False      3h30m
storage                                    4.6.0-fc.4  

Expected results:
Downgrade works and all operators are back to 4.5.

Comment 1 Vadim Rutkovsky 2020-09-10 17:33:25 UTC

Downgrades are not supported

Comment 2 W. Trevor King 2020-09-12 21:10:02 UTC

Vadim is correct about the fact that these are not supported.  But it's still nice to understand how they break.  These are timing out in CI [1], and bugs in the CI tooling means we don't gather post-test assets when we time out the test, so there are no must-gathers there.  Hung, can you attach a must-gather from your run or a reproducer?  And we'll follow up with the test-platform folks about the lack of CI gathers.  Without a must-gather, it's hard to move forward.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.5-to-4.6/1304491478306787328

Comment 3 W. Trevor King 2020-10-04 02:34:56 UTC

Still no must-gather.  [1] will help with the timing-out CI if we can get it working and land it.

[1]: https://github.com/openshift/ci-tools/pull/1257

Comment 4 To Hung Sze 2020-10-05 21:09:18 UTC

Sorry.
I missed the request for must-gather.
Reproduced it today with 4.5.14 -> 4.6.0-fc.9 -> 4.5.14.
Have the must-gather.
Let me know what I should do.

Comment 5 To Hung Sze 2020-10-15 01:52:58 UTC

I reproduced this with 4.5.13 -> 4.6 -> 4.5.13

./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.5.13: the cluster operator storage has not yet successfully rolled out

Updates:

VERSION IMAGE
4.5.14  quay.io/openshift-release-dev/ocp-release@sha256:95cfe9273aecb9a0070176210477491c347f8e69e41759063642edf8bb8aceb6

./oc get  co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
csi-snapshot-controller                    4.5.13       True        False         False      101m
dns                                        4.6.0-rc.4   True        False         False      124m
machine-config                             4.6.0-rc.4   True        False         False      98m
marketplace                                4.5.13       True        False         False      66m
monitoring                                 4.5.13       True        False         False      65m
network                                    4.6.0-rc.4   True        False         False      5h45m
storage                                    4.6.0-rc.4   True        False         False      143m

./oc get clusterversion -o json|jq ".items[0].status.history"
[
  {
    "completionTime": null,
    "image": "quay.io/openshift-release-dev/ocp-release:4.5.13-x86_64",
    "startedTime": "2020-10-15T00:37:13Z",
    "state": "Partial",
    "verified": false,
    "version": "4.5.13"
  },
  {
    "completionTime": "2020-10-15T00:11:58Z",
    "image": "quay.io/openshift-release-dev/ocp-release:4.6.0-rc.4-x86_64",
    "startedTime": "2020-10-14T23:05:55Z",
    "state": "Completed",
    "verified": false,
    "version": "4.6.0-rc.4"
  },
  {
    "completionTime": "2020-10-14T20:27:28Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:8d104847fc2371a983f7cb01c7c0a3ab35b7381d6bf7ce355d9b32a08c0031f0",
    "startedTime": "2020-10-14T20:01:02Z",
    "state": "Completed",
    "verified": false,
    "version": "4.5.13"
  }
]

I have the must-gather. Too big to attach here. Please let me know with whom it should be shared with.

Thanks.

Comment 6 To Hung Sze 2020-10-15 17:30:42 UTC

Closing this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1882394

*** This bug has been marked as a duplicate of bug 1882394 ***

Comment 7 Yang Yang 2021-01-06 05:12:12 UTC

I'm experiencing it when upgrading AWS cluster from 4.6.9 -> 4.7.0-fc.1 -> 4.6.9. Must gather is available online https://drive.google.com/file/d/1ykM5ikJb-SwDZ29dJqJcsMk76-AlydlG/view?usp=sharing.

$ oc describe co/storage
Status:
  Conditions:
    Last Transition Time:  2021-01-06T04:15:30Z
    Message:               AWSEBSCSIDriverOperatorCRDegraded: ResourceSyncControllerDegraded: configmaps "kube-cloud-config" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:aws-ebs-csi-driver-operator" cannot get resource "configmaps" in API group "" in the namespace "openshift-config-managed"
    Reason:                AWSEBSCSIDriverOperatorCR_ResourceSyncController_Error
    Status:                True
    Type:                  Degraded

Comment 8 Lalatendu Mohanty 2021-01-08 18:08:26 UTC

Yang This bug was raised for different issue i.e. "the cluster operator storage has not yet successfully rolled out". Please raise a new issue for the error you reported.

*** This bug has been marked as a duplicate of bug 1882394 ***

Note You need to log in before you can comment on or make changes to this bug.