+++ This bug was initially created as a clone of Bug #2257982 +++ Description of problem (please be detailed as possible and provide log snippests): Using storage system wizard to create a storage system with external postgres. Noobaa-core and noobaa-db pods failed to create. Not all the pods are created and running: $ oc get pod NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-5dbfb55df9-85wxv 2/2 Running 0 14h csi-cephfsplugin-9xnsz 2/2 Running 0 14h csi-cephfsplugin-provisioner-58c69cfb78-7h5tv 6/6 Running 0 14h csi-cephfsplugin-provisioner-58c69cfb78-ccnng 6/6 Running 1 (14h ago) 14h csi-cephfsplugin-s2ngm 2/2 Running 0 14h csi-cephfsplugin-vbf98 2/2 Running 0 14h csi-rbdplugin-8s279 3/3 Running 0 14h csi-rbdplugin-p9rn5 3/3 Running 0 14h csi-rbdplugin-provisioner-d65774655-d5vtk 6/6 Running 0 14h csi-rbdplugin-provisioner-d65774655-lq24w 6/6 Running 0 14h csi-rbdplugin-tbr2v 3/3 Running 0 14h noobaa-operator-68b69cd44b-vdszf 2/2 Running 0 14h ocs-operator-859d787c7-vzzgf 1/1 Running 0 14h odf-console-8485dc45db-wpv28 1/1 Running 0 14h odf-operator-controller-manager-64fbbbdc4d-j25c6 2/2 Running 0 14h rook-ceph-crashcollector-tunguyen-111p-szz92-worker-1-7qp5lv872 1/1 Running 0 14h rook-ceph-crashcollector-tunguyen-111p-szz92-worker-2-zfcrmjs8s 1/1 Running 0 14h rook-ceph-crashcollector-tunguyen-111p-szz92-worker-3-dxr7dhnjj 1/1 Running 0 14h rook-ceph-exporter-tunguyen-111p-szz92-worker-1-7qp5j-7769pzhlc 1/1 Running 0 14h rook-ceph-exporter-tunguyen-111p-szz92-worker-2-zfcrm-bf5689l5n 1/1 Running 0 14h rook-ceph-exporter-tunguyen-111p-szz92-worker-3-dxr7h-6b47b8jvd 1/1 Running 0 14h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-fd986dd9sz762 2/2 Running 0 14h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-57f86548mc7z7 2/2 Running 0 14h rook-ceph-mgr-a-56747bb7c5-drh59 3/3 Running 0 14h rook-ceph-mgr-b-59d95b9d88-h4d9h 3/3 Running 0 14h rook-ceph-mon-a-64d7c864cc-wctc5 2/2 Running 0 14h rook-ceph-mon-b-55f5c4696d-xrkqs 2/2 Running 0 14h rook-ceph-mon-c-7dddc877c5-z2857 2/2 Running 0 14h rook-ceph-operator-b8cf888cf-jldx9 1/1 Running 0 14h ux-backend-server-695548595d-mjtzc 2/2 Running 0 14h Version of all relevant components (if applicable): ODF 4.15 build 4.15.0-112 $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.15.0-112.stable NooBaa Operator 4.15.0-112.stable Succeeded ocs-operator.v4.15.0-112.stable OpenShift Container Storage 4.15.0-112.stable Succeeded odf-csi-addons-operator.v4.15.0-112.stable CSI Addons 4.15.0-112.stable Succeeded odf-operator.v4.15.0-112.stable OpenShift Data Foundation 4.15.0-112.stable Succeeded Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, happy path testing is failing for epic https://issues.redhat.com/browse/RHSTOR-4749 Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. Deploy an OCP cluster 2. Install ODF build 4.15.0-112 3. Create storage system using storage system wizard 4. Select external postgres and input the database connection info 5. Complete the wizard and check for the installation progress Actual results: Storage system failed to create, noobaa pods failed to create. Expected results: Storage system, noobaa pods should create and running without any issue. Additional info: Must gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2257982/ Failing cluster is available for investigate. --- Additional comment from RHEL Program Management on 2024-01-11 22:51:13 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from Utkarsh Srivastava on 2024-01-15 11:14:49 UTC --- Hi, I talked about this with Romy and it seems that COSI CRDs are probably missing from the cluster which is stalling the NooBaa operator progress (so it seems that it is unrelated to external postgres). These CRDs are supposed to be installed by ODF. Romy shared the following command to install the CRDs: `kubectl create -k github.com/kubernetes-sigs/container-object-storage-interface-api`. Regards, Utkarsh Srivastava --- Additional comment from RHEL Program Management on 2024-01-16 11:51:43 UTC --- This BZ is being approved for ODF 4.15.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.15.0 --- Additional comment from RHEL Program Management on 2024-01-16 11:51:43 UTC --- Since this bug has been approved for ODF 4.15.0 release, through release flag 'odf-4.15.0+', the Target Release is being set to 'ODF 4.15.0 --- Additional comment from Jacky Albo on 2024-01-16 12:52:31 UTC --- In continuing to comment #2, It doesn't feel related to me although I'm not sure why the CRDs are not being installed. But Tiffany can you try and install the CRDs on the cluster and see if it helps? I think the main issue here is that no Noobaa CR was being created probably due to an issue in previous steps in the OCS operator. Looks like there are some ceph errors, but I'm not sure... I can see this in the events log in the must-gather: > 14:44:36 (x54) openshift-storage rook-ceph-file-controller ocs-storagecluster-cephfilesystem ReconcileFailed > failed to reconcile CephFilesystem "openshift-storage/ocs-storagecluster-cephfilesystem". failed to create filesystem "ocs-storagecluster-cephfilesystem": failed to create subvolume group "csi": failed to create subvolume group "ocs-storagecluster-cephfilesystem". . Error ETIMEDOUT: error calling ceph_mount: exit status 110 > 14:51:09 (x70) openshift-storage rook-ceph-block-pool-controller ocs-storagecluster-cephblockpool ReconcileFailed > failed to reconcile CephBlockPool "openshift-storage/ocs-storagecluster-cephblockpool". failed to create pool "ocs-storagecluster-cephblockpool".: failed to create pool "ocs-storagecluster-cephblockpool".: failed to initialize pool "ocs-storagecluster-cephblockpool" for RBD use. : signal: interrupt --- Additional comment from Tiffany Nguyen on 2024-01-17 23:00:34 UTC --- I installed CRD manually using below command: $ kubectl create -k github.com/kubernetes-sigs/container-object-storage-interface-api customresourcedefinition.apiextensions.k8s.io/bucketaccessclasses.objectstorage.k8s.io created customresourcedefinition.apiextensions.k8s.io/bucketaccesses.objectstorage.k8s.io created customresourcedefinition.apiextensions.k8s.io/bucketclaims.objectstorage.k8s.io created customresourcedefinition.apiextensions.k8s.io/bucketclasses.objectstorage.k8s.io created customresourcedefinition.apiextensions.k8s.io/buckets.objectstorage.k8s.io created However, noobaa-db and noobaa-core pods are not created: $ oc get pod | grep noobaa noobaa-operator-798cd44446-hgwpq 2/2 Running 0 46m --- Additional comment from krishnaram Karthick on 2024-01-22 08:08:53 UTC --- (In reply to Tiffany Nguyen from comment #6) > I installed CRD manually using below command: > > $ kubectl create -k > github.com/kubernetes-sigs/container-object-storage-interface-api > customresourcedefinition.apiextensions.k8s.io/bucketaccessclasses. > objectstorage.k8s.io created > customresourcedefinition.apiextensions.k8s.io/bucketaccesses.objectstorage. > k8s.io created > customresourcedefinition.apiextensions.k8s.io/bucketclaims.objectstorage.k8s. > io created > customresourcedefinition.apiextensions.k8s.io/bucketclasses.objectstorage. > k8s.io created > customresourcedefinition.apiextensions.k8s.io/buckets.objectstorage.k8s.io > created > > However, noobaa-db and noobaa-core pods are not created: > > $ oc get pod | grep noobaa > noobaa-operator-798cd44446-hgwpq 2/2 > Running 0 46m Jacky, could you pls take a look? --- Additional comment from Jacky Albo on 2024-01-22 10:19:24 UTC --- Was a noobaa CR created? As I said earlier, if no, there is an issue with ODF operator which is supposed to create the NooBaa CR for NooBaa operator to start reconciling it. From the previous logs it seems ceph as an issue, and probably that's why NooBaa wasn't started. But we need ODF operator/Ceph to take a look. To validate that no Noobaa CR is around you can run `oc get noobaa`. --- Additional comment from Nitin Goyal on 2024-01-23 06:27:12 UTC --- I looked at the cluster and found that the storagecluster was missing the crucial information of `storageDeviceSets` and `multiCloudGateway`. When someone wants to use this feature, UI should do 2 operations. 1. create the secret. 2. pass the secret to the storagecluster. UI is creating a secret, But it is not passing it to the storagecluster. I am moving the bug to the console team to take a look. The storage cluster CR spec allows the passing of secrets as demonstrated below. ``` spec: multiCloudGateway: externalPgConfig: pgSecretName: noobaa-external-pg ``` --- Additional comment from Vineet on 2024-01-29 11:03:03 UTC --- There is some issue with passing the spec values in UI. Working on the RCA, I will send an update soon --- Additional comment from errata-xmlrpc on 2024-02-01 11:39:41 UTC --- This bug has been added to advisory RHBA-2023:118688 by ceph-build service account (ceph-build.COM) --- Additional comment from Tiffany Nguyen on 2024-02-05 23:11:34 UTC --- Verifing the fix using build 4.15.0-130. The storagecluster is now getting "externalPgConfig" and "pgSecretName". However, there are few more issues are seen when configure an external postgres. As result, storagecluster doesn't deploy correctly and noobaa is not get created. 1. In the secret, "db_url" is incorrect. Provide link: postgres://postgres:postgres.99.1.namespace.svc:5432/Tiffany Correct link: postgresql://postgres:postgres.99.1:5432/tiffany 2. externalPgSSLRequired is set to "true" in noobaa.yaml when there is no SecureSSL database selected. This is causing the database connection error.
Test with ODF 4.14.0-139, noobaa still doesn't deploy with the new buid. There is no `externalPgSSLRequired` flag is set and also there is no 'storageDeviceSets:' section in storagecluster yaml. $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.0.1-snapshot Succeeded openshift-storage mcg-operator.v4.15.0-139.stable NooBaa Operator 4.15.0-139.stable Succeeded openshift-storage ocs-operator.v4.15.0-139.stable OpenShift Container Storage 4.15.0-139.stable Succeeded openshift-storage odf-csi-addons-operator.v4.15.0-139.stable CSI Addons 4.15.0-139.stable Succeeded openshift-storage odf-operator.v4.15.0-139.stable OpenShift Data Foundation 4.15.0-139.stable Succeeded $ oc get pod NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-555bcf9c9d-n5v8g 2/2 Running 0 27m csi-cephfsplugin-dszbm 2/2 Running 0 25m csi-cephfsplugin-provisioner-5b5575d8d5-2j4kx 6/6 Running 0 25m csi-cephfsplugin-provisioner-5b5575d8d5-pjdtn 6/6 Running 0 25m csi-cephfsplugin-sfmss 2/2 Running 0 25m csi-cephfsplugin-v2q6l 2/2 Running 1 (24m ago) 25m csi-rbdplugin-64k5k 3/3 Running 0 25m csi-rbdplugin-fscs6 3/3 Running 0 25m csi-rbdplugin-k7w7n 3/3 Running 1 (24m ago) 25m csi-rbdplugin-provisioner-df8895f7b-qxmcb 6/6 Running 0 25m csi-rbdplugin-provisioner-df8895f7b-sc6f5 6/6 Running 4 (23m ago) 25m noobaa-operator-7cccc64c59-mf6nf 2/2 Running 0 27m ocs-operator-5bc895b594-p6mgh 1/1 Running 0 27m odf-console-7c7d845fb-qwc66 1/1 Running 0 27m odf-operator-controller-manager-5ccc94dd7b-skswv 2/2 Running 0 27m rook-ceph-crashcollector-compute-0-755b9c4cf4-dqhgf 1/1 Running 0 22m rook-ceph-crashcollector-compute-1-5698884fdc-f8z7s 1/1 Running 0 22m rook-ceph-crashcollector-compute-2-6dc4dd7b4-xqhhw 1/1 Running 0 22m rook-ceph-exporter-compute-0-5f6768cb7b-dk2zz 1/1 Running 0 22m rook-ceph-exporter-compute-1-85676cbdd5-dm5xf 1/1 Running 0 22m rook-ceph-exporter-compute-2-7cd9d965d-rjhwp 1/1 Running 0 22m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-657ff949lnpns 2/2 Running 0 22m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-54f46f58nsl76 2/2 Running 0 22m rook-ceph-mgr-a-7777dc7c77-7z4tb 3/3 Running 0 22m rook-ceph-mgr-b-5d4685d7f8-p46nt 3/3 Running 0 22m rook-ceph-mon-a-8584b5768-mfw55 2/2 Running 0 23m rook-ceph-mon-b-5b5f99bcbd-r6vrc 2/2 Running 0 23m rook-ceph-mon-c-6d8bfdc8b4-wk8jc 2/2 Running 0 22m rook-ceph-operator-94b6546d-72hrq 1/1 Running 0 25m ux-backend-server-687cddc8b7-ldf72 2/2 Running 0 27m $ oc get storagecluster -o yaml apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2024-02-12T23:12:13Z" finalizers: - storagecluster.ocs.openshift.io generation: 3 name: ocs-storagecluster namespace: openshift-storage ownerReferences: - apiVersion: odf.openshift.io/v1alpha1 kind: StorageSystem name: ocs-storagecluster-storagesystem uid: b34322f9-cf0e-4158-b6a1-f500279b5caf resourceVersion: "96513" uid: 88d089e6-1dde-4f31-bac8-d2748509d02c spec: arbiter: {} encryption: kms: {} externalStorage: {} managedResources: cephBlockPools: {} cephCluster: {} cephConfig: {} cephDashboard: {} cephFilesystems: {} cephNonResilientPools: count: 1 cephObjectStoreUsers: {} cephObjectStores: {} cephRBDMirror: daemonCount: 1 cephToolbox: {} mirroring: {} multiCloudGateway: externalPgConfig: pgSecretName: noobaa-external-pg resourceProfile: balanced status: conditions: - lastHeartbeatTime: "2024-02-12T23:12:14Z" lastTransitionTime: "2024-02-12T23:12:14Z" message: Version check successful reason: VersionMatched status: "False" type: VersionMismatch - lastHeartbeatTime: "2024-02-12T23:40:46Z" lastTransitionTime: "2024-02-12T23:12:15Z" message: 'Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd]' reason: ReconcileFailed status: "False" type: ReconcileComplete - lastHeartbeatTime: "2024-02-12T23:12:14Z" lastTransitionTime: "2024-02-12T23:12:14Z" message: Initializing StorageCluster reason: Init status: "False" type: Available - lastHeartbeatTime: "2024-02-12T23:12:14Z" lastTransitionTime: "2024-02-12T23:12:14Z" message: Initializing StorageCluster reason: Init status: "True" type: Progressing - lastHeartbeatTime: "2024-02-12T23:12:14Z" lastTransitionTime: "2024-02-12T23:12:14Z" message: Initializing StorageCluster reason: Init status: "False" type: Degraded - lastHeartbeatTime: "2024-02-12T23:12:14Z" lastTransitionTime: "2024-02-12T23:12:14Z" message: Initializing StorageCluster reason: Init status: Unknown type: Upgradeable currentMonCount: 3 failureDomain: rack failureDomainKey: topology.rook.io/rack failureDomainValues: - rack0 - rack1 - rack2 images: ceph: actualImage: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226 desiredImage: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226 noobaaCore: desiredImage: registry.redhat.io/odf4/mcg-core-rhel9@sha256:1d79a2ac176ca6e69c3198d0e35537aaf29373440d214d324d0d433d1473d9a1 noobaaDB: desiredImage: registry.redhat.io/rhel9/postgresql-15@sha256:10e53e191e567248a514a7344c6d78432640aedbc1fa1f7b0364d3b88f8bde2c kmsServerConnection: {} nodeTopologies: labels: kubernetes.io/hostname: - compute-0 - compute-1 - compute-2 topology.rook.io/rack: - rack0 - rack1 - rack2 phase: Progressing relatedObjects: - apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: openshift-storage resourceVersion: "96510" uid: 191f41bb-f5d5-4a5b-bd95-c780f8089605 version: 4.15.0 kind: List metadata: resourceVersion: ""
Must gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2262974/ocs_must_gather_v212/
The issue is now fixed in build 4.15.0-142. I can successfully deploy an cluster with external pqsql.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383