Bug 2188588 - [ODFMS] Privatelink appliance mode provider addon deployement of ODF4.11 with ROSA4.11 failed with storageCluster stuck in Pending state
Summary: [ODFMS] Privatelink appliance mode provider addon deployement of ODF4.11 with...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Rewant
QA Contact: suchita
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-21 09:00 UTC by suchita
Modified: 2023-09-12 11:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-osd-deployer pull 289 0 None open update the egress rules to allow loadBalancer IP on consumer side 2023-05-31 11:12:51 UTC

Description suchita 2023-04-21 09:00:34 UTC
Description of problem:

Privatelink appliance mode provider dev-addon deployment failed with the storage cluster stuck in a Pending state 
Whereas Non-private link cluster deployment is successful with the same configurations.

$ oc get csv| grep -v Succeeded
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
ocs-osd-deployer.v2.0.12                  OCS OSD Deployer              2.0.12            ocs-osd-deployer.v2.0.11                  Failed

-----storage cluster-----------------
Status:
  Conditions:
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Error while reconciling: Operation cannot be fulfilled on services "ocs-provider-server": the object has been modified; please apply your changes to the latest version and try again
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2023-04-20T07:01:11Z
    Last Transition Time:  2023-04-20T07:01:11Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  External Storage:
    Granted Capacity:  0
  Images:
    Ceph:
      Desired Image:  registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:a42c490ba7aa8732ebc53a90ce33c4cb9cf8e556395cc9598f8808e0b719ebe7
    Noobaa Core:
      Desired Image:  registry.redhat.io/odf4/mcg-core-rhel8@sha256:f46f471baf226674d9ec79babd33a77633716801e041fbe07890b25d95f29d16
    Noobaa DB:
      Desired Image:  registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d
  Kms Server Connection:
  Phase:  Progressing
Events:
  Type    Reason   Age               From                       Message
  ----    ------   ----              ----                       -------
  Normal  Waiting  30m (x9 over 8h)  controller_storagecluster  Waiting for Ingress on service ocs-provider-server

-------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):

ROSA 4.11.36 
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.6                      NooBaa Operator               4.11.6            mcg-operator.v4.11.5                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.11.6                      OpenShift Container Storage   4.11.6            ocs-operator.v4.11.5                      Succeeded
ocs-osd-deployer.v2.0.12                  OCS OSD Deployer              2.0.12            ocs-osd-deployer.v2.0.11                  Failed
odf-csi-addons-operator.v4.11.6           CSI Addons                    4.11.6            odf-csi-addons-operator.v4.11.5           Succeeded
odf-operator.v4.11.6                      OpenShift Data Foundation     4.11.6            odf-operator.v4.11.5                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.496-7e66488   Route Monitor Operator        0.1.496-7e66488   route-monitor-operator.v0.1.494-a973226   Succeeded

This change is on the dev addon where the deployer is updated with ocs-osd-deployer.v2.0.12+ ODF4.11 changes

How reproducible:
5/5

Steps to Reproduce:
1. Deploy RHODF private link appliance mode provider
2.
3.

Actual results:
Deployment of the provider addon failed

Expected results:
Deployment should succeed

Additional info:
status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Pending



$ oc get pods
NAME                                                              READY   STATUS              RESTARTS   AGE
8507126c9da8e5eb41ac05d78e5d2f645deb81344ed9875c033afd3fd15h6kj   0/1     Completed           0          8h
addon-ocs-provider-dev-catalog-dwd5n                              1/1     Running             0          8h
alertmanager-managed-ocs-alertmanager-0                           2/2     Running             0          8h
csi-addons-controller-manager-c4bc7c984-s65wh                     2/2     Running             0          8h
d038f9a4d3f085fb4ec83447140c7847cd8f4a5321dde382d9dc6c9cd5lsw27   0/1     Completed           0          8h
ocs-metrics-exporter-778fdbf4b4-hvzb6                             1/1     Running             0          8h
ocs-operator-56cb54879b-rhklx                                     1/1     Running             0          8h
ocs-osd-aws-data-gather-5dcf5f8bb-p9wjb                           1/1     Running             0          8h
ocs-osd-controller-manager-755bdbc9bc-ghv9z                       2/3     Running             0          8h
odf-console-5979c56c7f-fftp8                                      1/1     Running             0          8h
odf-operator-controller-manager-5c74658b55-6kn55                  2/2     Running             0          8h
prometheus-managed-ocs-prometheus-0                               3/3     Running             0          8h
prometheus-operator-c74f5f6c9-fmkvw                               1/1     Running             0          8h
rook-ceph-operator-7cdd5b8fc5-q5f2b                               1/1     Running             0          10m
rook-ceph-tools-64fc7d784d-mh7lw                                  0/1     ContainerCreating   0          8h



$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.6                      NooBaa Operator               4.11.6            mcg-operator.v4.11.5                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.11.6                      OpenShift Container Storage   4.11.6            ocs-operator.v4.11.5                      Succeeded
ocs-osd-deployer.v2.0.12                  OCS OSD Deployer              2.0.12            ocs-osd-deployer.v2.0.11                  Failed
odf-csi-addons-operator.v4.11.6           CSI Addons                    4.11.6            odf-csi-addons-operator.v4.11.5           Succeeded
odf-operator.v4.11.6                      OpenShift Data Foundation     4.11.6            odf-operator.v4.11.5                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.496-7e66488   Route Monitor Operator        0.1.496-7e66488   route-monitor-operator.v0.1.494-a973226   Succeeded


Discussion Chat thread: https://chat.google.com/room/AAAASHA9vWs/ETkPa8M2w6c
Cluster's mustgather and other logs are here: 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-c4420/sgatfane-c4420_20230420T021443/logs/

Comment 1 Leela Venkaiah Gangavarapu 2023-04-24 08:47:24 UTC
- In one of the private link clusters I looked into has issue getting a load balancer up for provider server
- Deployer is awaiting storagecluster to be in in ready and it was pending due to provider server service
- Preliminary deduction, it is a bug at how we configure AWS for private link and needs a fix

Comment 2 Itzhak 2023-05-28 13:22:58 UTC
I deployed provider and consumer non-private link clusters. The provider deployed successfully, including the ocs-provider-qe addon. The consumer deployed successfully, but the ocs-consumer-qe failed to be installed. After editing the consumer addon with the command: "rosa edit addon -c ikave-np45-c1 ocs-consumer-qe --storage-provider-endpoint 10.0.13.178:31659"(where "10.0.13.178" is one of the provider worker node IPs). The ocs-osd-deployer changed to a Succeeded state after a few seconds.
So it seems that the issue is with the load balancer.

Additional info:

Link to the multicluster Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1713/.

Provider versions:

OC version:
Client Version: 4.10.24
Server Version: 4.11.40
Kubernetes Version: v1.24.12+ceaf338

OCS version:
ocs-operator.v4.11.5                      OpenShift Container Storage   4.11.5            ocs-operator.v4.11.4                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.40   True        False         4h47m   Error while reconciling 4.11.40: the cluster operator machine-config is degraded

Rook version:
rook: v4.11.5-0.d4bc197c9a967840c92dc0298fbd340b75a21836
go: go1.17.12

Ceph version:
ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)


Consumer versions:

OC version:
Client Version: 4.10.24
Server Version: 4.11.40
Kubernetes Version: v1.24.12+ceaf338

OCS version:
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded

Cluster version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.40   True        False         4h43m   Cluster version is 4.11.40

Rook version:
rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08
go: go1.16.12

Ceph version:
ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)

Comment 6 Rewant 2023-05-29 13:27:53 UTC
Found out the issue with non-private link deployment resulting in Error state, the operator was at ODF version 4.10, 

[jenkins@odf-ms-stage privateLink]$ oc get csv
oc NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.12                     NooBaa Operator               4.10.12           mcg-operator.v4.10.11                     Succeeded
observability-operator.v0.0.21            Observability Operator        0.0.21            observability-operator.v0.0.20            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.1.0                   OCS OSD Deployer              2.1.0             ocs-osd-deployer.v2.0.13                  Failed
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.500-6152b76   Route Monitor Operator        0.1.500-6152b76   route-monitor-operator.v0.1.498-e33e391   Succeeded
[jenkins@odf-ms-stage privateLink]$ oc get subs
NAME                                                                           PACKAGE                   SOURCE                          CHANNEL
addon-ocs-consumer-qe                                                          ocs-osd-deployer          addon-ocs-consumer-qe-catalog   alpha
mcg-operator-stable-4.10-redhat-operators-openshift-storage                    mcg-operator              redhat-operators                stable-4.10
ocs-operator-stable-4.10-redhat-operators-openshift-storage                    ocs-operator              redhat-operators                stable-4.10
odf-csi-addons-operator-stable-4.10-redhat-operators-openshift-storage         odf-csi-addons-operator   redhat-operators                stable-4.10
odf-operator-stable-4.10-redhat-operators-openshift-storage                    odf-operator              redhat-operators                stable-4.11
ose-prometheus-operator-beta-addon-ocs-consumer-qe-catalog-openshift-storage   ose-prometheus-operator   addon-ocs-consumer-qe-catalog   beta

Subs was unhealthy with:

  - message: 'constraints not satisfiable: no operators found in channel stable-4.11
      of package odf-operator in the catalog referenced by subscription odf-operator-stable-4.10-redhat-operators-openshift-storage,
      subscription odf-operator-stable-4.10-redhat-operators-openshift-storage exists'

In consumers we create additional catalogsource, we need to update the catalogSource image to 4.11 registry.
Opened the PR for the same: https://gitlab.cee.redhat.com/service/managed-tenants/-/merge_requests/4117

Comment 7 Rewant 2023-05-31 11:12:52 UTC
We also need to update the egress rules on consumer to allow access to loadbalancer endpoint. opened the PR for it: https://github.com/red-hat-storage/ocs-osd-deployer/pull/289

Comment 8 Elad 2023-06-05 15:44:26 UTC
Moving back to ASSIGNED as testing is still blocked

Comment 10 Rewant 2023-08-30 05:53:47 UTC
As ODF 4.11 is GA'ed moving it to ON_QA

Comment 11 suchita 2023-09-12 11:56:53 UTC
Tested and Verified on Privatelink multicidr setup:

Below are observations from the provider cluster 
----------------------------------------------------
$ rosa list service
SERVICE_ID                   SERVICE          SERVICE_STATE  CLUSTER_NAME
2VHeBrvqByGQhcsIXg1kmKn8oNi  ocs-provider-qe  ready          sgatfane-mpr12

$ ocm list cluster
ID                                NAME                          API URL                                                     OPENSHIFT_VERSION   PRODUCT ID      CLOUD_PROVIDER  REGION ID       STATE        
266qn9knre7ha7j4b05iube0a12hco81  sgatfane-12sc1                https://api.sgatfane-12sc1.p8co.s1.devshift.org:6443        4.12.31             rosa            aws             us-east-2       ready        
266qo2kbg2an40ktf80dnp5ab367ecbg  sgatfane-mpr12                https://api.sgatfane-mpr12.q7cc.s1.devshift.org:6443        4.11.48             rosa            aws             us-east-2       ready        

$ rosa list addon -c sgatfane-mpr12 | grep ready
ocs-provider-qe            Red Hat OpenShift Data Foundation Managed Service Provider (QE)       ready

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.48   True        False         74m     Cluster version is 4.11.48

$ oc get csv -n openshift-storage
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.10                     NooBaa Operator               4.11.10           mcg-operator.v4.11.9                      Succeeded
observability-operator.v0.0.25            Observability Operator        0.0.25            observability-operator.v0.0.25-rc         Succeeded
ocs-operator.v4.11.10                     OpenShift Container Storage   4.11.10           ocs-operator.v4.11.9                      Succeeded
ocs-osd-deployer.v2.1.0                   OCS OSD Deployer              2.1.0             ocs-osd-deployer.v2.0.13                  Succeeded
odf-csi-addons-operator.v4.11.10          CSI Addons                    4.11.10           odf-csi-addons-operator.v4.11.9           Succeeded
odf-operator.v4.11.10                     OpenShift Data Foundation     4.11.10           odf-operator.v4.11.9                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.570-71112a2   Route Monitor Operator        0.1.570-71112a2   route-monitor-operator.v0.1.568-8024e29   Succeeded

$ oc get storageconsumer -n openshift-storage
NAME                                                   AGE
storageconsumer-750c94d0-e592-4f25-b07a-f5f5b3ab193e   30m

$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
5e87ba6013d480df457c555e3aafb3a75dc5aa5ebda0a5d2c21a056557tkzs8   0/1     Completed   0          61m
935c0c67a00574d6b59eb0879ec238ae307e60a6d3ca2f26f759e1a4e4lp5gc   0/1     Completed   0          61m
addon-ocs-provider-qe-catalog-scn8q                               1/1     Running     0          62m
alertmanager-managed-ocs-alertmanager-0                           2/2     Running     0          60m
csi-addons-controller-manager-5dbd89bd55-8hn9g                    2/2     Running     0          60m
ocs-metrics-exporter-7668759db-8phqw                              1/1     Running     0          60m
ocs-operator-56f49b5655-d4zw8                                     1/1     Running     0          60m
ocs-osd-aws-data-gather-d6f5cc576-gkjhl                           1/1     Running     0          61m
ocs-osd-controller-manager-5fffb456-cfnkm                         3/3     Running     0          61m
ocs-provider-server-7c4cc59445-wmdq9                              1/1     Running     0          60m
odf-console-676c76b5f6-cb998                                      1/1     Running     0          60m
odf-operator-controller-manager-7d885fbd9-8kkb8                   2/2     Running     0          60m
prometheus-managed-ocs-prometheus-0                               3/3     Running     0          60m
prometheus-operator-c74f5f6c9-bg6pl                               1/1     Running     0          60m
rook-ceph-crashcollector-37dd6761c28e5dc6f65529850349a13a-d7x8x   1/1     Running     0          53m
rook-ceph-crashcollector-5638522c8f83ae7e5e4d8a044e666498-b8kgz   1/1     Running     0          56m
rook-ceph-crashcollector-d3929878ee0abed545a66823e3959720-qqgzl   1/1     Running     0          56m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c89c4d45k4qqg   2/2     Running     0          55m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5d9fc4547bpw2   2/2     Running     0          55m
rook-ceph-mgr-a-556c579f76-vxk2v                                  2/2     Running     0          56m
rook-ceph-mon-a-6dfc5d68c-jbh8v                                   2/2     Running     0          58m
rook-ceph-mon-b-78f6d55cb5-ctrhz                                  2/2     Running     0          55m
rook-ceph-mon-c-d5cd7d4d-lz4nl                                    2/2     Running     0          57m
rook-ceph-operator-65994df86f-5gb48                               1/1     Running     0          60m
rook-ceph-osd-0-8fcc9d547-xptnk                                   2/2     Running     0          55m
rook-ceph-osd-1-6ccfcf8d78-5lvkj                                  2/2     Running     0          56m
rook-ceph-osd-2-5bbfc6969-4t9fd                                   2/2     Running     0          55m
rook-ceph-osd-prepare-default-0-data-0n9jsc-fkk5f                 0/1     Completed   0          56m
rook-ceph-osd-prepare-default-1-data-0lqqfb-5jg67                 0/1     Completed   0          56m
rook-ceph-tools-6d5c885f76-mb2sn                                  1/1     Running     0          60m


$ oc get storagecluster -n openshift-storage
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   62m   Ready              2023-09-12T10:50:59Z 


$ oc get managedocs -n openshift-storage -o yaml 
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1alpha1
  kind: ManagedOCS
  metadata:
    creationTimestamp: "2023-09-12T10:50:43Z"
    finalizers:
    - managedocs.ocs.openshift.io
    generation: 1
    name: managedocs
    namespace: openshift-storage
    resourceVersion: "80724"
    uid: 97cc5d09-39e5-4a73-9599-7a4d8abebe56
  spec: {}
  status:
    components:
      alertmanager:
        state: Ready
      prometheus:
        state: Ready
      storageCluster:
        state: Ready
    reconcileStrategy: strict
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


------------------------------------------------------------


Based on above observation , marking this BZ as verified.
Tested while v2.1.0 release qualification


Note You need to log in before you can comment on or make changes to this bug.