2262974 – Noobaa system failed to create using external postgres - externalPgSSLRequired is always set true

Bug 2262974 - Noobaa system failed to create using external postgres - externalPgSSLRequired is always set true

Summary: Noobaa system failed to create using external postgres - externalPgSSLRequire...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Vineet
QA Contact:	Tiffany Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	2257982
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-06 11:28 UTC by Vineet
Modified:	2024-03-19 15:32 UTC (History)
CC List:	10 users (show)
Fixed In Version:	4.15.0-142
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2257982
Environment:
Last Closed:	2024-03-19 15:32:34 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2446	None	open	Bug 2262974: [release-4.15] Enable externalPgSSLRequired only when tlssecret exists	2024-02-06 15:28:09 UTC
Github	red-hat-storage ocs-operator pull 2459	None	open	Bug 2262974: [release-4.15] Adds spec changes to enable TLS on external Postgresql server	2024-02-14 08:52:28 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:32:37 UTC

Description Vineet 2024-02-06 11:28:44 UTC

+++ This bug was initially created as a clone of Bug #2257982 +++

Description of problem (please be detailed as possible and provide log
snippests):

Using storage system wizard to create a storage system with external postgres.  Noobaa-core and noobaa-db pods failed to create. 

Not all the pods are created and running:
$ oc get pod
NAME                                                              READY   STATUS    RESTARTS      AGE
csi-addons-controller-manager-5dbfb55df9-85wxv                    2/2     Running   0             14h
csi-cephfsplugin-9xnsz                                            2/2     Running   0             14h
csi-cephfsplugin-provisioner-58c69cfb78-7h5tv                     6/6     Running   0             14h
csi-cephfsplugin-provisioner-58c69cfb78-ccnng                     6/6     Running   1 (14h ago)   14h
csi-cephfsplugin-s2ngm                                            2/2     Running   0             14h
csi-cephfsplugin-vbf98                                            2/2     Running   0             14h
csi-rbdplugin-8s279                                               3/3     Running   0             14h
csi-rbdplugin-p9rn5                                               3/3     Running   0             14h
csi-rbdplugin-provisioner-d65774655-d5vtk                         6/6     Running   0             14h
csi-rbdplugin-provisioner-d65774655-lq24w                         6/6     Running   0             14h
csi-rbdplugin-tbr2v                                               3/3     Running   0             14h
noobaa-operator-68b69cd44b-vdszf                                  2/2     Running   0             14h
ocs-operator-859d787c7-vzzgf                                      1/1     Running   0             14h
odf-console-8485dc45db-wpv28                                      1/1     Running   0             14h
odf-operator-controller-manager-64fbbbdc4d-j25c6                  2/2     Running   0             14h
rook-ceph-crashcollector-tunguyen-111p-szz92-worker-1-7qp5lv872   1/1     Running   0             14h
rook-ceph-crashcollector-tunguyen-111p-szz92-worker-2-zfcrmjs8s   1/1     Running   0             14h
rook-ceph-crashcollector-tunguyen-111p-szz92-worker-3-dxr7dhnjj   1/1     Running   0             14h
rook-ceph-exporter-tunguyen-111p-szz92-worker-1-7qp5j-7769pzhlc   1/1     Running   0             14h
rook-ceph-exporter-tunguyen-111p-szz92-worker-2-zfcrm-bf5689l5n   1/1     Running   0             14h
rook-ceph-exporter-tunguyen-111p-szz92-worker-3-dxr7h-6b47b8jvd   1/1     Running   0             14h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-fd986dd9sz762   2/2     Running   0             14h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-57f86548mc7z7   2/2     Running   0             14h
rook-ceph-mgr-a-56747bb7c5-drh59                                  3/3     Running   0             14h
rook-ceph-mgr-b-59d95b9d88-h4d9h                                  3/3     Running   0             14h
rook-ceph-mon-a-64d7c864cc-wctc5                                  2/2     Running   0             14h
rook-ceph-mon-b-55f5c4696d-xrkqs                                  2/2     Running   0             14h
rook-ceph-mon-c-7dddc877c5-z2857                                  2/2     Running   0             14h
rook-ceph-operator-b8cf888cf-jldx9                                1/1     Running   0             14h
ux-backend-server-695548595d-mjtzc                                2/2     Running   0             14h


Version of all relevant components (if applicable):
ODF 4.15 build 4.15.0-112
$ oc get csv -n openshift-storage
NAME                                         DISPLAY                       VERSION             REPLACES   PHASE
mcg-operator.v4.15.0-112.stable              NooBaa Operator               4.15.0-112.stable              Succeeded
ocs-operator.v4.15.0-112.stable              OpenShift Container Storage   4.15.0-112.stable              Succeeded
odf-csi-addons-operator.v4.15.0-112.stable   CSI Addons                    4.15.0-112.stable              Succeeded
odf-operator.v4.15.0-112.stable              OpenShift Data Foundation     4.15.0-112.stable              Succeeded


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, happy path testing is failing for epic https://issues.redhat.com/browse/RHSTOR-4749


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Deploy an OCP cluster
2. Install ODF build 4.15.0-112
3. Create storage system using storage system wizard
4. Select external postgres and input the database connection info
5. Complete the wizard and check for the installation progress


Actual results:
Storage system failed to create, noobaa pods failed to create.



Expected results:
Storage system, noobaa pods should create and running without any issue.


Additional info:
Must gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2257982/
Failing cluster is available for investigate.

--- Additional comment from RHEL Program Management on 2024-01-11 22:51:13 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Utkarsh Srivastava on 2024-01-15 11:14:49 UTC ---

Hi,

I talked about this with Romy and it seems that COSI CRDs are probably missing from the cluster which is stalling the NooBaa operator progress (so it seems that it is unrelated to external postgres). These CRDs are supposed to be installed by ODF. Romy shared the following command to install the CRDs:
`kubectl create -k github.com/kubernetes-sigs/container-object-storage-interface-api`.

Regards,
Utkarsh Srivastava

--- Additional comment from RHEL Program Management on 2024-01-16 11:51:43 UTC ---

This BZ is being approved for ODF 4.15.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.15.0

--- Additional comment from RHEL Program Management on 2024-01-16 11:51:43 UTC ---

Since this bug has been approved for ODF 4.15.0 release, through release flag 'odf-4.15.0+', the Target Release is being set to 'ODF 4.15.0

--- Additional comment from Jacky Albo on 2024-01-16 12:52:31 UTC ---

In continuing to comment #2, It doesn't feel related to me although I'm not sure why the CRDs are not being installed.
But Tiffany can you try and install the CRDs on the cluster and see if it helps?
I think the main issue here is that no Noobaa CR was being created probably due to an issue in previous steps in the OCS operator. Looks like there are some ceph errors, but I'm not sure...
I can see this in the events log in the must-gather:

> 14:44:36 (x54)	openshift-storage	rook-ceph-file-controller	ocs-storagecluster-cephfilesystem	ReconcileFailed
> failed to reconcile CephFilesystem "openshift-storage/ocs-storagecluster-cephfilesystem". failed to create filesystem "ocs-storagecluster-cephfilesystem": failed to create subvolume group "csi": failed to create subvolume group "ocs-storagecluster-cephfilesystem". . Error ETIMEDOUT: error calling ceph_mount: exit status 110
> 14:51:09 (x70)	openshift-storage	rook-ceph-block-pool-controller	ocs-storagecluster-cephblockpool	ReconcileFailed
> failed to reconcile CephBlockPool "openshift-storage/ocs-storagecluster-cephblockpool". failed to create pool "ocs-storagecluster-cephblockpool".: failed to create pool "ocs-storagecluster-cephblockpool".: failed to initialize pool "ocs-storagecluster-cephblockpool" for RBD use. : signal: interrupt

--- Additional comment from Tiffany Nguyen on 2024-01-17 23:00:34 UTC ---

I installed CRD manually using below command:

$ kubectl create -k github.com/kubernetes-sigs/container-object-storage-interface-api
customresourcedefinition.apiextensions.k8s.io/bucketaccessclasses.objectstorage.k8s.io created
customresourcedefinition.apiextensions.k8s.io/bucketaccesses.objectstorage.k8s.io created
customresourcedefinition.apiextensions.k8s.io/bucketclaims.objectstorage.k8s.io created
customresourcedefinition.apiextensions.k8s.io/bucketclasses.objectstorage.k8s.io created
customresourcedefinition.apiextensions.k8s.io/buckets.objectstorage.k8s.io created

However, noobaa-db and noobaa-core pods are not created:

$ oc get pod | grep noobaa
noobaa-operator-798cd44446-hgwpq                                  2/2     Running   0          46m

--- Additional comment from krishnaram Karthick on 2024-01-22 08:08:53 UTC ---

(In reply to Tiffany Nguyen from comment #6)
> I installed CRD manually using below command:
> 
> $ kubectl create -k
> github.com/kubernetes-sigs/container-object-storage-interface-api
> customresourcedefinition.apiextensions.k8s.io/bucketaccessclasses.
> objectstorage.k8s.io created
> customresourcedefinition.apiextensions.k8s.io/bucketaccesses.objectstorage.
> k8s.io created
> customresourcedefinition.apiextensions.k8s.io/bucketclaims.objectstorage.k8s.
> io created
> customresourcedefinition.apiextensions.k8s.io/bucketclasses.objectstorage.
> k8s.io created
> customresourcedefinition.apiextensions.k8s.io/buckets.objectstorage.k8s.io
> created
> 
> However, noobaa-db and noobaa-core pods are not created:
> 
> $ oc get pod | grep noobaa
> noobaa-operator-798cd44446-hgwpq                                  2/2    
> Running   0          46m

Jacky, could you pls take a look?

--- Additional comment from Jacky Albo on 2024-01-22 10:19:24 UTC ---

Was a noobaa CR created? As I said earlier, if no, there is an issue with ODF operator which is supposed to create the NooBaa CR for NooBaa operator to start reconciling it.
From the previous logs it seems ceph as an issue, and probably that's why NooBaa wasn't started. But we need ODF operator/Ceph to take a look.
To validate that no Noobaa CR is around you can run `oc get noobaa`.

--- Additional comment from Nitin Goyal on 2024-01-23 06:27:12 UTC ---

I looked at the cluster and found that the storagecluster was missing the crucial information of `storageDeviceSets` and `multiCloudGateway`. 

When someone wants to use this feature, UI should do 2 operations.
1. create the secret.
2. pass the secret to the storagecluster.

UI is creating a secret, But it is not passing it to the storagecluster. I am moving the bug to the console team to take a look.


The storage cluster CR spec allows the passing of secrets as demonstrated below.
```
spec:
  multiCloudGateway:
    externalPgConfig:
      pgSecretName: noobaa-external-pg
```

--- Additional comment from Vineet on 2024-01-29 11:03:03 UTC ---

There is some issue with passing the spec values in UI. Working on the RCA, I will send an update soon

--- Additional comment from errata-xmlrpc on 2024-02-01 11:39:41 UTC ---

This bug has been added to advisory RHBA-2023:118688 by ceph-build service account (ceph-build.COM)

--- Additional comment from Tiffany Nguyen on 2024-02-05 23:11:34 UTC ---

Verifing the fix using build 4.15.0-130.  The storagecluster is now getting "externalPgConfig" and "pgSecretName". However, there are few more issues are seen when configure an external postgres.  As result, storagecluster doesn't deploy correctly and noobaa is not get created.

1. In the secret, "db_url" is incorrect.
   Provide link: postgres://postgres:postgres.99.1.namespace.svc:5432/Tiffany
   Correct link: postgresql://postgres:postgres.99.1:5432/tiffany

2. externalPgSSLRequired is set to "true" in noobaa.yaml when there is no SecureSSL database selected. This is causing the database connection error.

Comment 6 Tiffany Nguyen 2024-02-12 23:42:05 UTC

Test with ODF 4.14.0-139, noobaa still doesn't deploy with the new buid.  There is no `externalPgSSLRequired` flag is set and also there is no 'storageDeviceSets:' section in storagecluster yaml.
$ oc get csv -A
NAMESPACE                              NAME                                         DISPLAY                       VERSION             REPLACES   PHASE
openshift-operator-lifecycle-manager   packageserver                                Package Server                0.0.1-snapshot                 Succeeded
openshift-storage                      mcg-operator.v4.15.0-139.stable              NooBaa Operator               4.15.0-139.stable              Succeeded
openshift-storage                      ocs-operator.v4.15.0-139.stable              OpenShift Container Storage   4.15.0-139.stable              Succeeded
openshift-storage                      odf-csi-addons-operator.v4.15.0-139.stable   CSI Addons                    4.15.0-139.stable              Succeeded
openshift-storage                      odf-operator.v4.15.0-139.stable              OpenShift Data Foundation     4.15.0-139.stable              Succeeded

$ oc get pod
NAME                                                              READY   STATUS    RESTARTS      AGE
csi-addons-controller-manager-555bcf9c9d-n5v8g                    2/2     Running   0             27m
csi-cephfsplugin-dszbm                                            2/2     Running   0             25m
csi-cephfsplugin-provisioner-5b5575d8d5-2j4kx                     6/6     Running   0             25m
csi-cephfsplugin-provisioner-5b5575d8d5-pjdtn                     6/6     Running   0             25m
csi-cephfsplugin-sfmss                                            2/2     Running   0             25m
csi-cephfsplugin-v2q6l                                            2/2     Running   1 (24m ago)   25m
csi-rbdplugin-64k5k                                               3/3     Running   0             25m
csi-rbdplugin-fscs6                                               3/3     Running   0             25m
csi-rbdplugin-k7w7n                                               3/3     Running   1 (24m ago)   25m
csi-rbdplugin-provisioner-df8895f7b-qxmcb                         6/6     Running   0             25m
csi-rbdplugin-provisioner-df8895f7b-sc6f5                         6/6     Running   4 (23m ago)   25m
noobaa-operator-7cccc64c59-mf6nf                                  2/2     Running   0             27m
ocs-operator-5bc895b594-p6mgh                                     1/1     Running   0             27m
odf-console-7c7d845fb-qwc66                                       1/1     Running   0             27m
odf-operator-controller-manager-5ccc94dd7b-skswv                  2/2     Running   0             27m
rook-ceph-crashcollector-compute-0-755b9c4cf4-dqhgf               1/1     Running   0             22m
rook-ceph-crashcollector-compute-1-5698884fdc-f8z7s               1/1     Running   0             22m
rook-ceph-crashcollector-compute-2-6dc4dd7b4-xqhhw                1/1     Running   0             22m
rook-ceph-exporter-compute-0-5f6768cb7b-dk2zz                     1/1     Running   0             22m
rook-ceph-exporter-compute-1-85676cbdd5-dm5xf                     1/1     Running   0             22m
rook-ceph-exporter-compute-2-7cd9d965d-rjhwp                      1/1     Running   0             22m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-657ff949lnpns   2/2     Running   0             22m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-54f46f58nsl76   2/2     Running   0             22m
rook-ceph-mgr-a-7777dc7c77-7z4tb                                  3/3     Running   0             22m
rook-ceph-mgr-b-5d4685d7f8-p46nt                                  3/3     Running   0             22m
rook-ceph-mon-a-8584b5768-mfw55                                   2/2     Running   0             23m
rook-ceph-mon-b-5b5f99bcbd-r6vrc                                  2/2     Running   0             23m
rook-ceph-mon-c-6d8bfdc8b4-wk8jc                                  2/2     Running   0             22m
rook-ceph-operator-94b6546d-72hrq                                 1/1     Running   0             25m
ux-backend-server-687cddc8b7-ldf72                                2/2     Running   0             27m


$ oc get storagecluster -o yaml
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2024-02-12T23:12:13Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 3
    name: ocs-storagecluster
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: odf.openshift.io/v1alpha1
      kind: StorageSystem
      name: ocs-storagecluster-storagesystem
      uid: b34322f9-cf0e-4158-b6a1-f500279b5caf
    resourceVersion: "96513"
    uid: 88d089e6-1dde-4f31-bac8-d2748509d02c
  spec:
    arbiter: {}
    encryption:
      kms: {}
    externalStorage: {}
    managedResources:
      cephBlockPools: {}
      cephCluster: {}
      cephConfig: {}
      cephDashboard: {}
      cephFilesystems: {}
      cephNonResilientPools:
        count: 1
      cephObjectStoreUsers: {}
      cephObjectStores: {}
      cephRBDMirror:
        daemonCount: 1
      cephToolbox: {}
    mirroring: {}
    multiCloudGateway:
      externalPgConfig:
        pgSecretName: noobaa-external-pg
    resourceProfile: balanced
  status:
    conditions:
    - lastHeartbeatTime: "2024-02-12T23:12:14Z"
      lastTransitionTime: "2024-02-12T23:12:14Z"
      message: Version check successful
      reason: VersionMatched
      status: "False"
      type: VersionMismatch
    - lastHeartbeatTime: "2024-02-12T23:40:46Z"
      lastTransitionTime: "2024-02-12T23:12:15Z"
      message: 'Error while reconciling: some StorageClasses were skipped while waiting
        for pre-requisites to be met: [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd]'
      reason: ReconcileFailed
      status: "False"
      type: ReconcileComplete
    - lastHeartbeatTime: "2024-02-12T23:12:14Z"
      lastTransitionTime: "2024-02-12T23:12:14Z"
      message: Initializing StorageCluster
      reason: Init
      status: "False"
      type: Available
    - lastHeartbeatTime: "2024-02-12T23:12:14Z"
      lastTransitionTime: "2024-02-12T23:12:14Z"
      message: Initializing StorageCluster
      reason: Init
      status: "True"
      type: Progressing
    - lastHeartbeatTime: "2024-02-12T23:12:14Z"
      lastTransitionTime: "2024-02-12T23:12:14Z"
      message: Initializing StorageCluster
      reason: Init
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2024-02-12T23:12:14Z"
      lastTransitionTime: "2024-02-12T23:12:14Z"
      message: Initializing StorageCluster
      reason: Init
      status: Unknown
      type: Upgradeable
    currentMonCount: 3
    failureDomain: rack
    failureDomainKey: topology.rook.io/rack
    failureDomainValues:
    - rack0
    - rack1
    - rack2
    images:
      ceph:
        actualImage: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
        desiredImage: registry.redhat.io/rhceph/rhceph-6-rhel9@sha256:9dbd051cfcdb334aad33a536cc115ae1954edaea5f8cb5943ad615f1b41b0226
      noobaaCore:
        desiredImage: registry.redhat.io/odf4/mcg-core-rhel9@sha256:1d79a2ac176ca6e69c3198d0e35537aaf29373440d214d324d0d433d1473d9a1
      noobaaDB:
        desiredImage: registry.redhat.io/rhel9/postgresql-15@sha256:10e53e191e567248a514a7344c6d78432640aedbc1fa1f7b0364d3b88f8bde2c
    kmsServerConnection: {}
    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-0
        - compute-1
        - compute-2
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack2
    phase: Progressing
    relatedObjects:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      namespace: openshift-storage
      resourceVersion: "96510"
      uid: 191f41bb-f5d5-4a5b-bd95-c780f8089605
    version: 4.15.0
kind: List
metadata:
  resourceVersion: ""

Comment 7 Tiffany Nguyen 2024-02-13 01:00:24 UTC

Must gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2262974/ocs_must_gather_v212/

Comment 9 Tiffany Nguyen 2024-02-19 06:11:34 UTC

The issue is now fixed in build 4.15.0-142.  I can successfully deploy an cluster with external pqsql.

Comment 12 errata-xmlrpc 2024-03-19 15:32:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.