Bug 2051249

Summary:	[GSS]noobaa-db-pg-0 Pod stuck CrashLoopBackOff state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	khover
Component:	Multi-Cloud Object Gateway	Assignee:	Alexander Indenbaum <aindenba>
Status:	CLOSED ERRATA	QA Contact:	Ben Eli <belimele>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	aindenba, belimele, etamir, mhackett, mmuench, muagarwa, nbecker, ocs-bugs, odf-bz-bot, tdesala
Target Milestone:	---
Target Release:	ODF 4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.10.0-168	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-04-13 18:52:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description khover 2022-02-06 18:29:17 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After rebuild of ODF and rebuild of Noobaa, noobaa-db-pg-0 stuck in CLBO state.

grep -A6 'conditions' noobaa/namespaces/openshift-storage/noobaa.io/noobaas/noobaa.yaml

  conditions:
  - lastHeartbeatTime: "2022-02-04T19:22:35Z"
    lastTransitionTime: "2022-02-04T19:22:35Z"
    message: 'RPC: connection (0xc0016d0000) already closed &{RPC:0xc000502a50 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s}'
    reason: TemporaryError
    status: "False"
    type: Available



 omg get pods | grep -iv running
NAME                                                             READY  STATUS     RESTARTS  AGE
noobaa-db-pg-0                                                   0/1    Pending    0         1h44m
rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-0zs6mfhwlmm  0/1    Succeeded  0         2h47m
rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-12p4t7qvxp6  0/1    Succeeded  0         2h47m
rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-2q4mnmwg254  0/1    Succeeded  0         2h47m


namespaces/openshift-storage/pods/noobaa-db-pg-0/noobaa-db-pg-0.yaml

    state:
      waiting:
        message: back-off 5m0s restarting failed container=initialize-database pod=noobaa-db-pg-0_openshift-storage(546f8321-e79b-4a81-b5d7-e69fa8e45b61)
        reason: CrashLoopBackOff
  phase: Pending


logs noobaa-operator-5746b8bf88-jqnjk

2022-02-04T22:08:10.611587696Z time="2022-02-04T22:08:10Z" level=info msg="Update event detected for ocs-storagecluster-cephcluster (openshift-storage), queuing Reconcile"
2022-02-04T22:08:10.632800121Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists:  \"ocs-storagecluster-cephcluster\"\n"
2022-02-04T22:08:10.643587507Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists: NooBaa \"noobaa\"\n"
2022-02-04T22:08:10.646713150Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists: Service \"noobaa-mgmt\"\n"
2022-02-04T22:08:10.649641242Z time="2022-02-04T22:08:10Z" level=info msg="<U+274C> Not Found: Secret \"noobaa-operator\"\n"
2022-02-04T22:08:10.649665470Z time="2022-02-04T22:08:10Z" level=error msg="Could not connect to system Connect(): SecretOp not found"


openshift-storage		noobaa-db-pg-0	
FailedScheduling

0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.









Version of all relevant components (if applicable):


ocs-operator.v4.8.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Blocking ODF deployment

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 3 khover 2022-02-07 15:32:38 UTC

Had customer run the following

$ oc rsh noobaa-operator-5746b8bf88-jqnjk
sh-4.4$ 
sh-4.4$ 
sh-4.4$ 
sh-4.4$ curl -kv https://noobaa-mgmt.openshift-storage.svc:443
* Rebuilt URL to: https://noobaa-mgmt.openshift-storage.svc:443/
*   Trying 172.30.88.33...
* TCP_NODELAY set
* Connected to noobaa-mgmt.openshift-storage.svc (172.30.88.33) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=noobaa-mgmt.openshift-storage.svc
*  start date: Feb  4 19:22:39 2022 GMT
*  expire date: Feb  4 19:22:40 2024 GMT
*  issuer: CN=openshift-service-serving-signer@1638811499
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
> GET / HTTP/1.1
> Host: noobaa-mgmt.openshift-storage.svc
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 302 Found
< Location: /fe/
< Vary: Accept, Accept-Encoding
< Content-Type: text/plain; charset=utf-8
< Content-Length: 26
< Date: Mon, 07 Feb 2022 15:25:49 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
< 
* Connection #0 to host noobaa-mgmt.openshift-storage.svc left intact
Found. Redirecting to /fe/sh-4.4$

Comment 6 khover 2022-02-08 14:11:14 UTC

Hello,

Any updates or workaround we can share with the customer ?

The customer is loosing patience and faith in the product.

Comment 10 khover 2022-02-09 21:44:11 UTC

Hello, 

So if the workaround includes disabling huge pages, what is the process/steps for that ?

Comment 11 Alexander Indenbaum 2022-02-10 07:13:26 UTC

Hello

According to ODF must gather the Postgres container image is
registry.redhat.io/rhel8/postgresql-12@sha256:623bdaa1c6ae047db7f62d82526220fac099837afd8770ccc6acfac4c7cff100
i.e. this image uses RHEL8 as a base:
                     ^^^^^

> bash-4.4$ cat /etc/redhat-release
> Red Hat Enterprise Linux release 8.5 (Ootpa)

Previously Postgres container images used RHEL7 as a base.
                                          ^^^^^
For instance, the upstream uses:

> bash-4.2$ cat /etc/redhat-release
> CentOS Linux release 7.8.2003 (Core)

Is Postgres container image base OS change expected?

@khover the posted workaround PR would support RHEL8 fs layout and run Postgres with huge pages disabled.

Best regards!

Comment 12 khover 2022-02-10 11:38:18 UTC

My customer already has hugepages enabled in the cluster.

As per my understanding so far, the workaround is uninstall OCS, disable hugepages, reinstall OCS.

How do I disable huge pages temporarily during OCS installation ?


The customer temp is high so I just want to be sure that what we try next succeeds.

Comment 13 Alexander Indenbaum 2022-02-10 12:08:47 UTC

Comment 14 Alexander Indenbaum 2022-02-10 12:26:37 UTC

@khover any input about why RHEL8 based Postgres container is used?

As a workaround see https://github.com/noobaa/noobaa-operator/pull/853/files, it is a tiny change.

To get through DB initialization, "oc edit cm noobaa-postgres-initdb-sh" and add if block bellow marked by a plus sign. This procedure does not require reinstalling OCS.
 

          # Wrap the postgres binary, force huge_pages=off for initdb
          # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792
          p=/opt/rh/rh-postgresql12/root/usr/bin/postgres
+
+          # Latest RH images moved the postgres binary
+          # from /opt/rh/rh-postgresql12/root/usr/bin/postgres to /usr/bin/postgres
+          # see https://bugzilla.redhat.com/show_bug.cgi?id=2051249
+          if [ ! -x $p ]; then
+            p=/usr/bin/postgres
+          fi
+
          mv $p $p.orig
          echo exec $p.orig \"\$@\" -c huge_pages=off > $p
          chmod 755 $p

Alternatively:
- you could disable huge pages during OCS/NooBaa installation and then re-enabled huge pages
- use RHEL7 based Postgres container

Let me know if you need any additional help.

Best regards!

Comment 15 khover 2022-02-10 14:27:07 UTC

Hi Alex,

Thanks for all your help on this.

Re:

@khover any input about why RHEL8 based Postgres container is used?

I honestly dont know, this is a install of OCS ocs-operator.v4.8.6

IF there is some way to check or info needed id be happy to help you collect.

Comment 26 errata-xmlrpc 2022-04-13 18:52:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372

Comment 27 Red Hat Bugzilla 2023-12-08 04:27:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days