Bug 2091594 - [MS] RHODF MS add on deployer upgrade failed for v2.0.1 to v2.0.2 on OCP 4.8.36 cluster
Summary: [MS] RHODF MS add on deployer upgrade failed for v2.0.1 to v2.0.2 on OCP 4.8....
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: Ohad
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On: 2056697 2093205
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-30 12:20 UTC by suchita
Modified: 2023-08-09 17:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-20 09:50:26 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2056697 1 unspecified CLOSED odf-csi-addons-operator subscription failed while using custom catalog source 2023-08-09 17:00:26 UTC

Description suchita 2022-05-30 12:20:26 UTC
Description of problem:
Cluster with addon version v2.0.1 and OCP version 4.8.36  upgrade Failed to addon deployer version v2.0.2

while preparing for Deployer UPgrade v2.0.1 to v2.0.2 on the stagging stable add-on, we have 2 types of cluster setup 
Setup 1.  Provide OCP4.10.14+ ODF addon v2.0.1 and 2 Consumer with OCP4.10.14 and ODF Consumer add-on   v2.0.1
Setup 2.  Provide OCP4.10.14+ ODF addon v2.0.1 and 2 Consumer with OCP4.8.36 and ODF Consumer add-on   v2.0.1

Upgrade succeeded on seyp 1 provider and consumer however failed to upgrade on consumers of setup2 clusters
 

Version-Release number of selected component (if applicable):

oc get csv -n openshift-storage
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.2                      NooBaa Operator               4.10.2            mcg-operator.v4.10.1                      Succeeded
ocs-operator.v4.10.0                      OpenShift Container Storage   4.10.0                                                      Succeeded
ocs-osd-deployer.v2.0.1                   OCS OSD Deployer              2.0.1             ocs-osd-deployer.v2.0.0                   Succeeded
odf-operator.v4.10.0                      OpenShift Data Foundation     4.10.0                                                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.418-6459408   Route Monitor Operator        0.1.418-6459408   route-monitor-operator.v0.1.408-c2256a2   Succeeded


Openshitversion: 4.8.36
Addon  - ocs-consumer in stagging env

How reproducible:
4/4

Steps to Reproduce:
1. Create an appliance provider cluster with OCP 4.10 and ocs-provider addon
(rosa create service --type ocs-provider --name $CLUSTER_NAME --size 20 --onboarding-validation-key $CONSUMER_KEY  --subnet-ids $SUBNET_IDS )

2.Create rosa Consumer cluster with OCP4.8 and ocs-consumer addon
 
3.Initiate upgrade
(https://gitlab.cee.redhat.com/service/managed-tenants/-/merge_requests/2376
https://gitlab.cee.redhat.com/service/managed-tenants/-/merge_requests/2377)

Actual results:
Consumer CLuster with OCP4.8 and ODF 2.0.1 Failed to upgrade to deployer version v2.0.2


Expected results:
Consumer CLuster with OCP4.8 and ODF 2.0.1 should also upgrade to deployer version v2.0.2

Additional info:
Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-m26c1/sgatfane-m26c1_20220526T145938/openshift-cluster-dir/bz_upgrade_2091594/


The similar issue was observed while upgrading QE add-on 
more discussion in  slack thread:
https://coreos.slack.com/archives/C01L46M0FQC/p1652954205830049 => https://coreos.slack.com/archives/C01L46M0FQC/p1653912629377239?thread_ts=1652954205.830049&cid=C01L46M0FQC
https://coreos.slack.com/archives/C01L46M0FQC/p1653914999880409
Gchat room thread: https://chat.google.com/room/AAAASHA9vWs/xmAh4PDRZh0

the probable reason mentioned in g-chat thread:
`Previously addon catalogSource was created in openshift-marketplace but MT-SRE have updated tooling such that addon catalog will get created in targetNamespace
That’s the reason they have to create network policy for catalogSource`

Comment 2 Ohad 2022-05-30 15:09:54 UTC
RCA:

ODF 4.10 deployments include an operator named odf-csi-addons-operator which odf-operator is creating a subscription object for in code. 
Because the subscription is created manually, and not using OLM dependencies, it means that the subscription is created with a static catalog namespace which is  openshift-marketplace.

On ODF MS deployments, we override the marketplace catalog with a local catalog, inside the openshift-storage namespace.
This work for all dependencies that are coming via olm dependencies including ocs-operator and mcg-operator.
 
But for the odf-csi-addons-operator operator the subscription is still referring to the openshift-marketplace catalog.
On OCP 4.8 deployments, OLM is unable to satisfy the subscription.

OLM model operator upgrade is "all or nothing" inside a single namespace. This means that a single unsatisfied subscription will block any other subscription updates/changes until that issue is resolved.
Because we have a broken subscription in the namespace the addon (deployer) upgrade is halted and will not continue until the odf-csi-addons-operator subscription will either be deleted or updated.

--------------------------------------------------

Manual Mitigation (workaround):
An SRE will have to go into the openshift-namespace and edit the subscription for odf-csi-addons-operator, changing the catalog namesapce from openshift-marketplace to openshift-storage.
This workaround was tried and proven successful

--------------------------------------------------

Fix:
The product needs to add the odf-csi-addons-operator into odf-operator dependencies.yaml to be resolved by OLM.


Note You need to log in before you can comment on or make changes to this bug.