Bug 1879919

Summary: [External] Upgrade mechanism from OCS 4.5 to OCS 4.6 needs to be fixed
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Rachael <rgeorge>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Rachael <rgeorge>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: asachan, assingh, madam, muagarwa, nberry, ocs-bugs, shan, tnielsen
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.6.0-110.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1881071 (view as bug list) Environment:
Last Closed: 2020-12-17 06:24:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1881071    

Description Rachael 2020-09-17 11:03:48 UTC
Description of problem (please be detailed as possible and provide log
snippests):

OCS upgrade from 4.5 to 4.6 in external mode is broken. The following error messages were seen in the rook-operator logs after the upgrade:

2020-09-17 10:01:51.840563 E | ceph-cluster-controller: failed to reconcile. failed to reconcile cluster "ocs-external-storagecluster-cephcluster": failed to configure external ceph cluster: failed to configure external cluster monitoring: failed to create or update mgr endpoint: failed to create endpoint "rook-ceph-mgr-external". Endpoints "rook-ceph-mgr-external" is invalid: [subsets[0].addresses[0].ip: Invalid value: "": must be a valid IP address, (e.g. 10.9.8.7), subsets[0].addresses[0].ip: Invalid value: "": must be a valid IP address]

It was also observed that RGW OBCs were stuck in Pending state after the upgrade with the following error:

2020-09-17 09:47:18.134717 I | op-bucket-prov: getting storage class "ocs-external-storagecluster-ceph-rgw"
E0917 09:47:18.136502       8 controller.go:197] error syncing 'failure/rgw-1': error provisioning bucket: failed to get cephObjectStore: error getting cephObjectStore: resource name may not be empty, requeuing

With the addition of monitoring IP and extra permissions from OCS 4.6, these need to be updated on the 4.5 external mode cluster before/during upgrade.

Version of all relevant components (if applicable):
OCP version: 4.6.0-0.nightly-2020-09-17-004654

$ oc get csv
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.6.0-564.ci   OpenShift Container Storage   4.6.0-564.ci   ocs-operator.v4.5.0-560.ci   Succeeded

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Not able to perform a successful upgrade to newer OCS version

Is there any workaround available to the best of your knowledge?
Not that I am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Not a regression

Steps to Reproduce:

1. Upgrade OCS 4.5 external mode cluster to OCS 4.6

  - oc edit catsrc/ocs-catalogsource -n openshift-marketplace -> Change image to OCS 4.6 image


Actual results:
The upgrade is not successful. RGW OBC is stuck in pending state and reconcile fails


Expected results:
Upgrade should be successful

Comment 13 Travis Nielsen 2020-09-30 19:40:45 UTC
@Rachael Please share the rook operator log from the latest repro to confirm if the error is exactly the same or if it is different now.

Comment 19 errata-xmlrpc 2020-12-17 06:24:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605