Bug 2188588
| Summary: | [ODFMS] Privatelink appliance mode provider addon deployement of ODF4.11 with ROSA4.11 failed with storageCluster stuck in Pending state | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | suchita <sgatfane> |
| Component: | odf-managed-service | Assignee: | Rewant <resoni> |
| Status: | ASSIGNED --- | QA Contact: | Neha Berry <nberry> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.11 | CC: | ebenahar, ikave, lgangava, odf-bz-bot, owasserm, resoni, sgatfane |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
- In one of the private link clusters I looked into has issue getting a load balancer up for provider server - Deployer is awaiting storagecluster to be in in ready and it was pending due to provider server service - Preliminary deduction, it is a bug at how we configure AWS for private link and needs a fix I deployed provider and consumer non-private link clusters. The provider deployed successfully, including the ocs-provider-qe addon. The consumer deployed successfully, but the ocs-consumer-qe failed to be installed. After editing the consumer addon with the command: "rosa edit addon -c ikave-np45-c1 ocs-consumer-qe --storage-provider-endpoint 10.0.13.178:31659"(where "10.0.13.178" is one of the provider worker node IPs). The ocs-osd-deployer changed to a Succeeded state after a few seconds. So it seems that the issue is with the load balancer. Additional info: Link to the multicluster Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-odf-multicluster/1713/. Provider versions: OC version: Client Version: 4.10.24 Server Version: 4.11.40 Kubernetes Version: v1.24.12+ceaf338 OCS version: ocs-operator.v4.11.5 OpenShift Container Storage 4.11.5 ocs-operator.v4.11.4 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.40 True False 4h47m Error while reconciling 4.11.40: the cluster operator machine-config is degraded Rook version: rook: v4.11.5-0.d4bc197c9a967840c92dc0298fbd340b75a21836 go: go1.17.12 Ceph version: ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable) Consumer versions: OC version: Client Version: 4.10.24 Server Version: 4.11.40 Kubernetes Version: v1.24.12+ceaf338 OCS version: ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.40 True False 4h43m Cluster version is 4.11.40 Rook version: rook: v4.10.9-0.b7b3a0044169fd9364683e2e4e6968361f8f3c08 go: go1.16.12 Ceph version: ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable) Found out the issue with non-private link deployment resulting in Error state, the operator was at ODF version 4.10,
[jenkins@odf-ms-stage privateLink]$ oc get csv
oc NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.10.12 NooBaa Operator 4.10.12 mcg-operator.v4.10.11 Succeeded
observability-operator.v0.0.21 Observability Operator 0.0.21 observability-operator.v0.0.20 Succeeded
ocs-operator.v4.10.9 OpenShift Container Storage 4.10.9 ocs-operator.v4.10.8 Succeeded
ocs-osd-deployer.v2.1.0 OCS OSD Deployer 2.1.0 ocs-osd-deployer.v2.0.13 Failed
odf-csi-addons-operator.v4.10.9 CSI Addons 4.10.9 odf-csi-addons-operator.v4.10.8 Succeeded
odf-operator.v4.10.9 OpenShift Data Foundation 4.10.9 odf-operator.v4.10.8 Succeeded
ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded
route-monitor-operator.v0.1.500-6152b76 Route Monitor Operator 0.1.500-6152b76 route-monitor-operator.v0.1.498-e33e391 Succeeded
[jenkins@odf-ms-stage privateLink]$ oc get subs
NAME PACKAGE SOURCE CHANNEL
addon-ocs-consumer-qe ocs-osd-deployer addon-ocs-consumer-qe-catalog alpha
mcg-operator-stable-4.10-redhat-operators-openshift-storage mcg-operator redhat-operators stable-4.10
ocs-operator-stable-4.10-redhat-operators-openshift-storage ocs-operator redhat-operators stable-4.10
odf-csi-addons-operator-stable-4.10-redhat-operators-openshift-storage odf-csi-addons-operator redhat-operators stable-4.10
odf-operator-stable-4.10-redhat-operators-openshift-storage odf-operator redhat-operators stable-4.11
ose-prometheus-operator-beta-addon-ocs-consumer-qe-catalog-openshift-storage ose-prometheus-operator addon-ocs-consumer-qe-catalog beta
Subs was unhealthy with:
- message: 'constraints not satisfiable: no operators found in channel stable-4.11
of package odf-operator in the catalog referenced by subscription odf-operator-stable-4.10-redhat-operators-openshift-storage,
subscription odf-operator-stable-4.10-redhat-operators-openshift-storage exists'
In consumers we create additional catalogsource, we need to update the catalogSource image to 4.11 registry.
Opened the PR for the same: https://gitlab.cee.redhat.com/service/managed-tenants/-/merge_requests/4117
We also need to update the egress rules on consumer to allow access to loadbalancer endpoint. opened the PR for it: https://github.com/red-hat-storage/ocs-osd-deployer/pull/289 Moving back to ASSIGNED as testing is still blocked |
Description of problem: Privatelink appliance mode provider dev-addon deployment failed with the storage cluster stuck in a Pending state Whereas Non-private link cluster deployment is successful with the same configurations. $ oc get csv| grep -v Succeeded NAME DISPLAY VERSION REPLACES PHASE ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Failed -----storage cluster----------------- Status: Conditions: Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Error while reconciling: Operation cannot be fulfilled on services "ocs-provider-server": the object has been modified; please apply your changes to the latest version and try again Reason: ReconcileFailed Status: False Type: ReconcileComplete Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: False Type: Available Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: True Type: Progressing Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: False Type: Degraded Last Heartbeat Time: 2023-04-20T07:01:11Z Last Transition Time: 2023-04-20T07:01:11Z Message: Initializing StorageCluster Reason: Init Status: Unknown Type: Upgradeable External Storage: Granted Capacity: 0 Images: Ceph: Desired Image: registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:a42c490ba7aa8732ebc53a90ce33c4cb9cf8e556395cc9598f8808e0b719ebe7 Noobaa Core: Desired Image: registry.redhat.io/odf4/mcg-core-rhel8@sha256:f46f471baf226674d9ec79babd33a77633716801e041fbe07890b25d95f29d16 Noobaa DB: Desired Image: registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d Kms Server Connection: Phase: Progressing Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Waiting 30m (x9 over 8h) controller_storagecluster Waiting for Ingress on service ocs-provider-server ------------------------------------------------------------------------------- Version-Release number of selected component (if applicable): ROSA 4.11.36 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.6 NooBaa Operator 4.11.6 mcg-operator.v4.11.5 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.11.6 OpenShift Container Storage 4.11.6 ocs-operator.v4.11.5 Succeeded ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Failed odf-csi-addons-operator.v4.11.6 CSI Addons 4.11.6 odf-csi-addons-operator.v4.11.5 Succeeded odf-operator.v4.11.6 OpenShift Data Foundation 4.11.6 odf-operator.v4.11.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.496-7e66488 Route Monitor Operator 0.1.496-7e66488 route-monitor-operator.v0.1.494-a973226 Succeeded This change is on the dev addon where the deployer is updated with ocs-osd-deployer.v2.0.12+ ODF4.11 changes How reproducible: 5/5 Steps to Reproduce: 1. Deploy RHODF private link appliance mode provider 2. 3. Actual results: Deployment of the provider addon failed Expected results: Deployment should succeed Additional info: status: components: alertmanager: state: Ready prometheus: state: Ready storageCluster: state: Pending $ oc get pods NAME READY STATUS RESTARTS AGE 8507126c9da8e5eb41ac05d78e5d2f645deb81344ed9875c033afd3fd15h6kj 0/1 Completed 0 8h addon-ocs-provider-dev-catalog-dwd5n 1/1 Running 0 8h alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 8h csi-addons-controller-manager-c4bc7c984-s65wh 2/2 Running 0 8h d038f9a4d3f085fb4ec83447140c7847cd8f4a5321dde382d9dc6c9cd5lsw27 0/1 Completed 0 8h ocs-metrics-exporter-778fdbf4b4-hvzb6 1/1 Running 0 8h ocs-operator-56cb54879b-rhklx 1/1 Running 0 8h ocs-osd-aws-data-gather-5dcf5f8bb-p9wjb 1/1 Running 0 8h ocs-osd-controller-manager-755bdbc9bc-ghv9z 2/3 Running 0 8h odf-console-5979c56c7f-fftp8 1/1 Running 0 8h odf-operator-controller-manager-5c74658b55-6kn55 2/2 Running 0 8h prometheus-managed-ocs-prometheus-0 3/3 Running 0 8h prometheus-operator-c74f5f6c9-fmkvw 1/1 Running 0 8h rook-ceph-operator-7cdd5b8fc5-q5f2b 1/1 Running 0 10m rook-ceph-tools-64fc7d784d-mh7lw 0/1 ContainerCreating 0 8h $ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.6 NooBaa Operator 4.11.6 mcg-operator.v4.11.5 Succeeded observability-operator.v0.0.20 Observability Operator 0.0.20 observability-operator.v0.0.19 Succeeded ocs-operator.v4.11.6 OpenShift Container Storage 4.11.6 ocs-operator.v4.11.5 Succeeded ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Failed odf-csi-addons-operator.v4.11.6 CSI Addons 4.11.6 odf-csi-addons-operator.v4.11.5 Succeeded odf-operator.v4.11.6 OpenShift Data Foundation 4.11.6 odf-operator.v4.11.5 Succeeded ose-prometheus-operator.4.10.0 Prometheus Operator 4.10.0 ose-prometheus-operator.4.8.0 Succeeded route-monitor-operator.v0.1.496-7e66488 Route Monitor Operator 0.1.496-7e66488 route-monitor-operator.v0.1.494-a973226 Succeeded Discussion Chat thread: https://chat.google.com/room/AAAASHA9vWs/ETkPa8M2w6c Cluster's mustgather and other logs are here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-c4420/sgatfane-c4420_20230420T021443/logs/