Bug 2188588

Summary: [ODFMS] Privatelink appliance mode provider addon deployement of ODF4.11 with ROSA4.11 failed with storageCluster stuck in Pending state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: suchita <sgatfane>
Component: odf-managed-serviceAssignee: Rewant <resoni>
Status: ASSIGNED --- QA Contact: Neha Berry <nberry>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.11CC: ebenahar, ikave, lgangava, odf-bz-bot, owasserm, resoni, sgatfane
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description suchita 2023-04-21 09:00:34 UTC
Description of problem:

Privatelink appliance mode provider dev-addon deployment failed with the storage cluster stuck in a Pending state 
Whereas Non-private link cluster deployment is successful with the same configurations.

$ oc get csv| grep -v Succeeded
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
ocs-osd-deployer.v2.0.12                  OCS OSD Deployer              2.0.12            ocs-osd-deployer.v2.0.11                  Failed

-----storage cluster-----------------
Status:
  Conditions:
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Error while reconciling: Operation cannot be fulfilled on services "ocs-provider-server": the object has been modified; please apply your changes to the latest version and try again
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  External Storage:
    Granted Capacity:  0
  Images:
    Ceph:
      Desired Image:  registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:a42c490ba7aa8732ebc53a90ce33c4cb9cf8e556395cc9598f8808e0b719ebe7
    Noobaa Core:
      Desired Image:  registry.redhat.io/odf4/mcg-core-rhel8@sha256:f46f471baf226674d9ec79babd33a77633716801e041fbe07890b25d95f29d16
    Noobaa DB:
      Desired Image:  registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d
  Kms Server Connection:
  Phase:  Progressing
Events:
  Type    Reason   Age               From                       Message
  ----    ------   ----              ----                       -------
  Normal  Waiting  30m (x9 over 8h)  controller_storagecluster  Waiting for Ingress on service ocs-provider-server

-------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):

ROSA 4.11.36 
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.6                      NooBaa Operator               4.11.6            mcg-operator.v4.11.5                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.11.6                      OpenShift Container Storage   4.11.6            ocs-operator.v4.11.5                      Succeeded
ocs-osd-deployer.v2.0.12                  OCS OSD Deployer              2.0.12            ocs-osd-deployer.v2.0.11                  Failed
odf-csi-addons-operator.v4.11.6           CSI Addons                    4.11.6            odf-csi-addons-operator.v4.11.5           Succeeded
odf-operator.v4.11.6                      OpenShift Data Foundation     4.11.6            odf-operator.v4.11.5                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.496-7e66488   Route Monitor Operator        0.1.496-7e66488   route-monitor-operator.v0.1.494-a973226   Succeeded

This change is on the dev addon where the deployer is updated with ocs-osd-deployer.v2.0.12+ ODF4.11 changes

How reproducible:
5/5

Steps to Reproduce:
1. Deploy RHODF private link appliance mode provider
2.
3.

Actual results:
Deployment of the provider addon failed

Expected results:
Deployment should succeed

Additional info:
status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Pending



$ oc get pods
NAME                                                              READY   STATUS              RESTARTS   AGE
8507126c9da8e5eb41ac05d78e5d2f645deb81344ed9875c033afd3fd15h6kj   0/1     Completed           0          8h
addon-ocs-provider-dev-catalog-dwd5n                              1/1     Running             0          8h
alertmanager-managed-ocs-alertmanager-0                           2/2     Running             0          8h
csi-addons-controller-manager-c4bc7c984-s65wh                     2/2     Running             0          8h
d038f9a4d3f085fb4ec83447140c7847cd8f4a5321dde382d9dc6c9cd5lsw27   0/1     Completed           0          8h
ocs-metrics-exporter-778fdbf4b4-hvzb6                             1/1     Running             0          8h
ocs-operator-56cb54879b-rhklx                                     1/1     Running             0          8h
ocs-osd-aws-data-gather-5dcf5f8bb-p9wjb                           1/1     Running             0          8h
ocs-osd-controller-manager-755bdbc9bc-ghv9z                       2/3     Running             0          8h
odf-console-5979c56c7f-fftp8                                      1/1     Running             0          8h
odf-operator-controller-manager-5c74658b55-6kn55                  2/2     Running             0          8h
prometheus-managed-ocs-prometheus-0                               3/3     Running             0          8h
prometheus-operator-c74f5f6c9-fmkvw                               1/1     Running             0          8h
rook-ceph-operator-7cdd5b8fc5-q5f2b                               1/1     Running             0          10m
rook-ceph-tools-64fc7d784d-mh7lw                                  0/1     ContainerCreating   0          8h



$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.6                      NooBaa Operator               4.11.6            mcg-operator.v4.11.5                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.11.6                      OpenShift Container Storage   4.11.6            ocs-operator.v4.11.5                      Succeeded
ocs-osd-deployer.v2.0.12                  OCS OSD Deployer              2.0.12            ocs-osd-deployer.v2.0.11                  Failed
odf-csi-addons-operator.v4.11.6           CSI Addons                    4.11.6            odf-csi-addons-operator.v4.11.5           Succeeded
odf-operator.v4.11.6                      OpenShift Data Foundation     4.11.6            odf-operator.v4.11.5                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.496-7e66488   Route Monitor Operator        0.1.496-7e66488   route-monitor-operator.v0.1.494-a973226   Succeeded


Discussion Chat thread: https://chat.google.com/room/AAAASHA9vWs/ETkPa8M2w6c
Cluster's mustgather and other logs are here: 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-c4420/sgatfane-c4420_20230420T021443/logs/

Comment 1 Leela Venkaiah Gangavarapu 2023-04-24 08:47:24 UTC
- In one of the private link clusters I looked into has issue getting a load balancer up for provider server
- Deployer is awaiting storagecluster to be in in ready and it was pending due to provider server service
- Preliminary deduction, it is a bug at how we configure AWS for private link and needs a fix

Comment 2 Itzhak 2023-05-28 13:22:58 UTC
I deployed provider and consumer non-private link clusters. The provider deployed successfully, including the ocs-provider-qe addon. The consumer deployed successfully, but the ocs-consumer-qe failed to be installed. After editing the consumer addon with the command: "rosa edit addon -c ikave-np45-c1 ocs-consumer-qe --storage-provider-endpoint 10.0.13.178:31659"(where "10.0.13.178" is one of the provider worker node IPs). The ocs-osd-deployer changed to a Succeeded state after a few seconds.
So it seems that the issue is with the load balancer.

Additional info:

Link to the multicluster Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1713/.

Provider versions:

OC version:
Client Version: 4.10.24
Server Version: 4.11.40
Kubernetes Version: v1.24.12+ceaf338

OCS version:
ocs-operator.v4.11.5                      OpenShift Container Storage   4.11.5            ocs-operator.v4.11.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.40   True        False         4h47m   Error while reconciling 4.11.40: the cluster operator machine-config is degraded

Rook version:
rook: v4.11.5-0.d4bc197c9a967840c92dc0298fbd340b75a21836
go: go1.17.12

Ceph version:
ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)


Consumer versions:

OC version:
Client Version: 4.10.24
Server Version: 4.11.40
Kubernetes Version: v1.24.12+ceaf338

OCS version:
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.40   True        False         4h43m   Cluster version is 4.11.40

Rook version:
rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08
go: go1.16.12

Ceph version:
ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)

Comment 6 Rewant 2023-05-29 13:27:53 UTC
Found out the issue with non-private link deployment resulting in Error state, the operator was at ODF version 4.10, 

[jenkins@odf-ms-stage privateLink]$ oc get csv
oc NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.12                     NooBaa Operator               4.10.12           mcg-operator.v4.10.11                     Succeeded
observability-operator.v0.0.21            Observability Operator        0.0.21            observability-operator.v0.0.20            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.1.0                   OCS OSD Deployer              2.1.0             ocs-osd-deployer.v2.0.13                  Failed
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.500-6152b76   Route Monitor Operator        0.1.500-6152b76   route-monitor-operator.v0.1.498-e33e391   Succeeded
[jenkins@odf-ms-stage privateLink]$ oc get subs
NAME                                                                           PACKAGE                   SOURCE                          CHANNEL
addon-ocs-consumer-qe                                                          ocs-osd-deployer          addon-ocs-consumer-qe-catalog   alpha
mcg-operator-stable-4.10-redhat-operators-openshift-storage                    mcg-operator              redhat-operators                stable-4.10
ocs-operator-stable-4.10-redhat-operators-openshift-storage                    ocs-operator              redhat-operators                stable-4.10
odf-csi-addons-operator-stable-4.10-redhat-operators-openshift-storage         odf-csi-addons-operator   redhat-operators                stable-4.10
odf-operator-stable-4.10-redhat-operators-openshift-storage                    odf-operator              redhat-operators                stable-4.11
ose-prometheus-operator-beta-addon-ocs-consumer-qe-catalog-openshift-storage   ose-prometheus-operator   addon-ocs-consumer-qe-catalog   beta

Subs was unhealthy with:

  - message: 'constraints not satisfiable: no operators found in channel stable-4.11
      of package odf-operator in the catalog referenced by subscription odf-operator-stable-4.10-redhat-operators-openshift-storage,
      subscription odf-operator-stable-4.10-redhat-operators-openshift-storage exists'

In consumers we create additional catalogsource, we need to update the catalogSource image to 4.11 registry.
Opened the PR for the same: https://gitlab.cee.redhat.com/service/managed-tenants/-/merge_requests/4117

Comment 7 Rewant 2023-05-31 11:12:52 UTC
We also need to update the egress rules on consumer to allow access to loadbalancer endpoint. opened the PR for it: https://github.com/red-hat-storage/ocs-osd-deployer/pull/289

Comment 8 Elad 2023-06-05 15:44:26 UTC
Moving back to ASSIGNED as testing is still blocked