Description of problem: Privatelink appliance mode provider dev-addon deployment failed with the storage cluster stuck in a Pending state Whereas Non-private link cluster deployment is successful with the same configurations. $ oc get csv| grep -v Succeeded NAME DISPLAY VERSION REPLACES PHASE ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Failed -----storage cluster----------------- Status: Conditions: Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Error while reconciling: Operation cannot be fulfilled on services "ocs-provider-server": the object has been modified; please apply your changes to the latest version and try again Reason: ReconcileFailed Status: False Type: ReconcileComplete Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: False Type: Available Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: True Type: Progressing Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: False Type: Degraded Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: Unknown Type: Upgradeable External Storage: Granted Capacity: 0 Images: Ceph: Desired Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:a42c490ba7aa8732ebc53a90ce33c4cb9cf8e556395cc9598f8808e0b719ebe7 Noobaa Core: Desired Image: registry.redhat.io/odf4/mcg-core-rhel8@sha256:f46f471baf226674d9ec79babd33a77633716801e041fbe07890b25d95f29d16 Noobaa DB: Desired Image: registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d Kms Server Connection: Phase: Progressing Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Waiting 30m (x9 over 8h) controller_storagecluster Waiting for Ingress on service ocs-provider-server ------------------------------------------------------------------------------- Version-Release number of selected component (if applicable): ROSA 4.11.36 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.6 NooBaa Operator 4.11.6 mcg-operator.v4.11.5 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.11.6 OpenShift Container Storage 4.11.6 ocs-operator.v4.11.5 Succeeded ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Failed odf-csi-addons-operator.v4.11.6 CSI Addons 4.11.6 odf-csi-addons-operator.v4.11.5 Succeeded odf-operator.v4.11.6 OpenShift Data Foundation 4.11.6 odf-operator.v4.11.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.496-7e66488 Route Monitor Operator 0.1.496-7e66488 route-monitor-operator.v0.1.494-a973226 Succeeded This change is on the dev addon where the deployer is updated with ocs-osd-deployer.v2.0.12+ ODF4.11 changes How reproducible: 5/5 Steps to Reproduce: 1. Deploy RHODF private link appliance mode provider 2. 3. Actual results: Deployment of the provider addon failed Expected results: Deployment should succeed Additional info: status: components: alertmanager: state: Ready prometheus: state: Ready storageCluster: state: Pending $ oc get pods NAME READY STATUS RESTARTS AGE 8507126c9da8e5eb41ac05d78e5d2f645deb81344ed9875c033afd3fd15h6kj 0/1 Completed 0 8h addon-ocs-provider-dev-catalog-dwd5n 1/1 Running 0 8h alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 8h csi-addons-controller-manager-c4bc7c984-s65wh 2/2 Running 0 8h d038f9a4d3f085fb4ec83447140c7847cd8f4a5321dde382d9dc6c9cd5lsw27 0/1 Completed 0 8h ocs-metrics-exporter-778fdbf4b4-hvzb6 1/1 Running 0 8h ocs-operator-56cb54879b-rhklx 1/1 Running 0 8h ocs-osd-aws-data-gather-5dcf5f8bb-p9wjb 1/1 Running 0 8h ocs-osd-controller-manager-755bdbc9bc-ghv9z 2/3 Running 0 8h odf-console-5979c56c7f-fftp8 1/1 Running 0 8h odf-operator-controller-manager-5c74658b55-6kn55 2/2 Running 0 8h prometheus-managed-ocs-prometheus-0 3/3 Running 0 8h prometheus-operator-c74f5f6c9-fmkvw 1/1 Running 0 8h rook-ceph-operator-7cdd5b8fc5-q5f2b 1/1 Running 0 10m rook-ceph-tools-64fc7d784d-mh7lw 0/1 ContainerCreating 0 8h $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.6 NooBaa Operator 4.11.6 mcg-operator.v4.11.5 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.11.6 OpenShift Container Storage 4.11.6 ocs-operator.v4.11.5 Succeeded ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Failed odf-csi-addons-operator.v4.11.6 CSI Addons 4.11.6 odf-csi-addons-operator.v4.11.5 Succeeded odf-operator.v4.11.6 OpenShift Data Foundation 4.11.6 odf-operator.v4.11.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.496-7e66488 Route Monitor Operator 0.1.496-7e66488 route-monitor-operator.v0.1.494-a973226 Succeeded Discussion Chat thread: https://chat.google.com/room/AAAASHA9vWs/ETkPa8M2w6c Cluster's mustgather and other logs are here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-c4420/sgatfane-c4420_20230420T021443/logs/
- In one of the private link clusters I looked into has issue getting a load balancer up for provider server - Deployer is awaiting storagecluster to be in in ready and it was pending due to provider server service - Preliminary deduction, it is a bug at how we configure AWS for private link and needs a fix
I deployed provider and consumer non-private link clusters. The provider deployed successfully, including the ocs-provider-qe addon. The consumer deployed successfully, but the ocs-consumer-qe failed to be installed. After editing the consumer addon with the command: "rosa edit addon -c ikave-np45-c1 ocs-consumer-qe --storage-provider-endpoint 10.0.13.178:31659"(where "10.0.13.178" is one of the provider worker node IPs). The ocs-osd-deployer changed to a Succeeded state after a few seconds. So it seems that the issue is with the load balancer. Additional info: Link to the multicluster Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1713/. Provider versions: OC version: Client Version: 4.10.24 Server Version: 4.11.40 Kubernetes Version: v1.24.12+ceaf338 OCS version: ocs-operator.v4.11.5 OpenShift Container Storage 4.11.5 ocs-operator.v4.11.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.40 True False 4h47m Error while reconciling 4.11.40: the cluster operator machine-config is degraded Rook version: rook: v4.11.5-0.d4bc197c9a967840c92dc0298fbd340b75a21836 go: go1.17.12 Ceph version: ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable) Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.11.40 Kubernetes Version: v1.24.12+ceaf338 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.40 True False 4h43m Cluster version is 4.11.40 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)
Found out the issue with non-private link deployment resulting in Error state, the operator was at ODF version 4.10, [jenkins@odf-ms-stage privateLink]$ oc get csv oc NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.12 NooBaa Operator 4.10.12 mcg-operator.v4.10.11 Succeeded observability-operator.v0.0.21 Observability Operator 0.0.21 observability-operator.v0.0.20 Succeeded ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded ocs-osd-deployer.v2.1.0 OCS OSD Deployer 2.1.0 ocs-osd-deployer.v2.0.13 Failed odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.500-6152b76 Route Monitor Operator 0.1.500-6152b76 route-monitor-operator.v0.1.498-e33e391 Succeeded [jenkins@odf-ms-stage privateLink]$ oc get subs NAME PACKAGE SOURCE CHANNEL addon-ocs-consumer-qe ocs-osd-deployer addon-ocs-consumer-qe-catalog alpha mcg-operator-stable-4.10-redhat-operators-openshift-storage mcg-operator redhat-operators stable-4.10 ocs-operator-stable-4.10-redhat-operators-openshift-storage ocs-operator redhat-operators stable-4.10 odf-csi-addons-operator-stable-4.10-redhat-operators-openshift-storage odf-csi-addons-operator redhat-operators stable-4.10 odf-operator-stable-4.10-redhat-operators-openshift-storage odf-operator redhat-operators stable-4.11 ose-prometheus-operator-beta-addon-ocs-consumer-qe-catalog-openshift-storage ose-prometheus-operator addon-ocs-consumer-qe-catalog beta Subs was unhealthy with: - message: 'constraints not satisfiable: no operators found in channel stable-4.11 of package odf-operator in the catalog referenced by subscription odf-operator-stable-4.10-redhat-operators-openshift-storage, subscription odf-operator-stable-4.10-redhat-operators-openshift-storage exists' In consumers we create additional catalogsource, we need to update the catalogSource image to 4.11 registry. Opened the PR for the same: https://gitlab.cee.redhat.com/service/managed-tenants/-/merge_requests/4117
We also need to update the egress rules on consumer to allow access to loadbalancer endpoint. opened the PR for it: https://github.com/red-hat-storage/ocs-osd-deployer/pull/289
Moving back to ASSIGNED as testing is still blocked
As ODF 4.11 is GA'ed moving it to ON_QA
Tested and Verified on Privatelink multicidr setup: Below are observations from the provider cluster ---------------------------------------------------- $ rosa list service SERVICE_ID SERVICE SERVICE_STATE CLUSTER_NAME 2VHeBrvqByGQhcsIXg1kmKn8oNi ocs-provider-qe ready sgatfane-mpr12 $ ocm list cluster ID NAME API URL OPENSHIFT_VERSION PRODUCT ID CLOUD_PROVIDER REGION ID STATE 266qn9knre7ha7j4b05iube0a12hco81 sgatfane-12sc1 https://api.sgatfane-12sc1.p8co.s1.devshift.org:6443 4.12.31 rosa aws us-east-2 ready 266qo2kbg2an40ktf80dnp5ab367ecbg sgatfane-mpr12 https://api.sgatfane-mpr12.q7cc.s1.devshift.org:6443 4.11.48 rosa aws us-east-2 ready $ rosa list addon -c sgatfane-mpr12 | grep ready ocs-provider-qe Red Hat OpenShift Data Foundation Managed Service Provider (QE) ready $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.48 True False 74m Cluster version is 4.11.48 $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.10 NooBaa Operator 4.11.10 mcg-operator.v4.11.9 Succeeded observability-operator.v0.0.25 Observability Operator 0.0.25 observability-operator.v0.0.25-rc Succeeded ocs-operator.v4.11.10 OpenShift Container Storage 4.11.10 ocs-operator.v4.11.9 Succeeded ocs-osd-deployer.v2.1.0 OCS OSD Deployer 2.1.0 ocs-osd-deployer.v2.0.13 Succeeded odf-csi-addons-operator.v4.11.10 CSI Addons 4.11.10 odf-csi-addons-operator.v4.11.9 Succeeded odf-operator.v4.11.10 OpenShift Data Foundation 4.11.10 odf-operator.v4.11.9 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.570-71112a2 Route Monitor Operator 0.1.570-71112a2 route-monitor-operator.v0.1.568-8024e29 Succeeded $ oc get storageconsumer -n openshift-storage NAME AGE storageconsumer-750c94d0-e592-4f25-b07a-f5f5b3ab193e 30m $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE 5e87ba6013d480df457c555e3aafb3a75dc5aa5ebda0a5d2c21a056557tkzs8 0/1 Completed 0 61m 935c0c67a00574d6b59eb0879ec238ae307e60a6d3ca2f26f759e1a4e4lp5gc 0/1 Completed 0 61m addon-ocs-provider-qe-catalog-scn8q 1/1 Running 0 62m alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 60m csi-addons-controller-manager-5dbd89bd55-8hn9g 2/2 Running 0 60m ocs-metrics-exporter-7668759db-8phqw 1/1 Running 0 60m ocs-operator-56f49b5655-d4zw8 1/1 Running 0 60m ocs-osd-aws-data-gather-d6f5cc576-gkjhl 1/1 Running 0 61m ocs-osd-controller-manager-5fffb456-cfnkm 3/3 Running 0 61m ocs-provider-server-7c4cc59445-wmdq9 1/1 Running 0 60m odf-console-676c76b5f6-cb998 1/1 Running 0 60m odf-operator-controller-manager-7d885fbd9-8kkb8 2/2 Running 0 60m prometheus-managed-ocs-prometheus-0 3/3 Running 0 60m prometheus-operator-c74f5f6c9-bg6pl 1/1 Running 0 60m rook-ceph-crashcollector-37dd6761c28e5dc6f65529850349a13a-d7x8x 1/1 Running 0 53m rook-ceph-crashcollector-5638522c8f83ae7e5e4d8a044e666498-b8kgz 1/1 Running 0 56m rook-ceph-crashcollector-d3929878ee0abed545a66823e3959720-qqgzl 1/1 Running 0 56m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c89c4d45k4qqg 2/2 Running 0 55m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5d9fc4547bpw2 2/2 Running 0 55m rook-ceph-mgr-a-556c579f76-vxk2v 2/2 Running 0 56m rook-ceph-mon-a-6dfc5d68c-jbh8v 2/2 Running 0 58m rook-ceph-mon-b-78f6d55cb5-ctrhz 2/2 Running 0 55m rook-ceph-mon-c-d5cd7d4d-lz4nl 2/2 Running 0 57m rook-ceph-operator-65994df86f-5gb48 1/1 Running 0 60m rook-ceph-osd-0-8fcc9d547-xptnk 2/2 Running 0 55m rook-ceph-osd-1-6ccfcf8d78-5lvkj 2/2 Running 0 56m rook-ceph-osd-2-5bbfc6969-4t9fd 2/2 Running 0 55m rook-ceph-osd-prepare-default-0-data-0n9jsc-fkk5f 0/1 Completed 0 56m rook-ceph-osd-prepare-default-1-data-0lqqfb-5jg67 0/1 Completed 0 56m rook-ceph-tools-6d5c885f76-mb2sn 1/1 Running 0 60m $ oc get storagecluster -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 62m Ready 2023-09-12T10:50:59Z $ oc get managedocs -n openshift-storage -o yaml apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1alpha1 kind: ManagedOCS metadata: creationTimestamp: "2023-09-12T10:50:43Z" finalizers: - managedocs.ocs.openshift.io generation: 1 name: managedocs namespace: openshift-storage resourceVersion: "80724" uid: 97cc5d09-39e5-4a73-9599-7a4d8abebe56 spec: {} status: components: alertmanager: state: Ready prometheus: state: Ready storageCluster: state: Ready reconcileStrategy: strict kind: List metadata: resourceVersion: "" selfLink: "" ------------------------------------------------------------ Based on above observation , marking this BZ as verified. Tested while v2.1.0 release qualification