2051249 – [GSS]noobaa-db-pg-0 Pod stuck CrashLoopBackOff state

Bug 2051249 - [GSS]noobaa-db-pg-0 Pod stuck CrashLoopBackOff state

Summary: [GSS]noobaa-db-pg-0 Pod stuck CrashLoopBackOff state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.10.0
Assignee:	Alexander Indenbaum
QA Contact:	Ben Eli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-06 18:29 UTC by khover
Modified:	2023-12-08 04:27 UTC (History)
CC List:	10 users (show)
Fixed In Version:	4.10.0-168
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-13 18:52:48 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-operator pull 853	None	Merged	Fix noobaa-db-pg-0 Pod stuck CrashLoopBackOff state	2022-02-23 10:03:12 UTC
Github	noobaa noobaa-operator pull 856	None	Merged	Backport to 5.10	2022-02-23 10:03:56 UTC
Github	noobaa noobaa-operator pull 856/commits	None	None	None	2022-02-23 10:01:38 UTC
Github	red-hat-storage ocs-ci pull 4461	None	Merged	Add function to enable huge pages	2022-06-20 12:56:09 UTC
Red Hat Issue Tracker	INSIGHTOCP-589	None	None	None	2022-02-11 17:10:24 UTC
Red Hat Issue Tracker	INSIGHTOCP-590	None	None	None	2022-02-11 17:10:53 UTC
Red Hat Product Errata	RHSA-2022:1372	None	None	None	2022-04-13 18:52:58 UTC

Description khover 2022-02-06 18:29:17 UTC

Description of problem (please be detailed as possible and provide log
snippests):

After rebuild of ODF and rebuild of Noobaa, noobaa-db-pg-0 stuck in CLBO state.

grep -A6 'conditions' noobaa/namespaces/openshift-storage/noobaa.io/noobaas/noobaa.yaml

  conditions:
  - lastHeartbeatTime: "2022-02-04T19:22:35Z"
    lastTransitionTime: "2022-02-04T19:22:35Z"
    message: 'RPC: connection (0xc0016d0000) already closed &{RPC:0xc000502a50 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:closed WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s}'
    reason: TemporaryError
    status: "False"
    type: Available



 omg get pods | grep -iv running
NAME                                                             READY  STATUS     RESTARTS  AGE
noobaa-db-pg-0                                                   0/1    Pending    0         1h44m
rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-0zs6mfhwlmm  0/1    Succeeded  0         2h47m
rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-12p4t7qvxp6  0/1    Succeeded  0         2h47m
rook-ceph-osd-prepare-ocs-deviceset-localocs-0-data-2q4mnmwg254  0/1    Succeeded  0         2h47m


namespaces/openshift-storage/pods/noobaa-db-pg-0/noobaa-db-pg-0.yaml

    state:
      waiting:
        message: back-off 5m0s restarting failed container=initialize-database pod=noobaa-db-pg-0_openshift-storage(546f8321-e79b-4a81-b5d7-e69fa8e45b61)
        reason: CrashLoopBackOff
  phase: Pending


logs noobaa-operator-5746b8bf88-jqnjk

2022-02-04T22:08:10.611587696Z time="2022-02-04T22:08:10Z" level=info msg="Update event detected for ocs-storagecluster-cephcluster (openshift-storage), queuing Reconcile"
2022-02-04T22:08:10.632800121Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists:  \"ocs-storagecluster-cephcluster\"\n"
2022-02-04T22:08:10.643587507Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists: NooBaa \"noobaa\"\n"
2022-02-04T22:08:10.646713150Z time="2022-02-04T22:08:10Z" level=info msg="<U+2705> Exists: Service \"noobaa-mgmt\"\n"
2022-02-04T22:08:10.649641242Z time="2022-02-04T22:08:10Z" level=info msg="<U+274C> Not Found: Secret \"noobaa-operator\"\n"
2022-02-04T22:08:10.649665470Z time="2022-02-04T22:08:10Z" level=error msg="Could not connect to system Connect(): SecretOp not found"


openshift-storage		noobaa-db-pg-0	
FailedScheduling

0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.









Version of all relevant components (if applicable):


ocs-operator.v4.8.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Blocking ODF deployment

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 3 khover 2022-02-07 15:32:38 UTC

Had customer run the following

$ oc rsh noobaa-operator-5746b8bf88-jqnjk
sh-4.4$ 
sh-4.4$ 
sh-4.4$ 
sh-4.4$ curl -kv https://noobaa-mgmt.openshift-storage.svc:443
* Rebuilt URL to: https://noobaa-mgmt.openshift-storage.svc:443/
*   Trying 172.30.88.33...
* TCP_NODELAY set
* Connected to noobaa-mgmt.openshift-storage.svc (172.30.88.33) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=noobaa-mgmt.openshift-storage.svc
*  start date: Feb  4 19:22:39 2022 GMT
*  expire date: Feb  4 19:22:40 2024 GMT
*  issuer: CN=openshift-service-serving-signer@1638811499
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
> GET / HTTP/1.1
> Host: noobaa-mgmt.openshift-storage.svc
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 302 Found
< Location: /fe/
< Vary: Accept, Accept-Encoding
< Content-Type: text/plain; charset=utf-8
< Content-Length: 26
< Date: Mon, 07 Feb 2022 15:25:49 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
< 
* Connection #0 to host noobaa-mgmt.openshift-storage.svc left intact
Found. Redirecting to /fe/sh-4.4$

Comment 6 khover 2022-02-08 14:11:14 UTC

Hello,

Any updates or workaround we can share with the customer ?

The customer is loosing patience and faith in the product.

Comment 10 khover 2022-02-09 21:44:11 UTC

Hello, 

So if the workaround includes disabling huge pages, what is the process/steps for that ?

Comment 11 Alexander Indenbaum 2022-02-10 07:13:26 UTC

Hello

According to ODF must gather the Postgres container image is
registry.redhat.io/rhel8/postgresql-12@sha256:623bdaa1c6ae047db7f62d82526220fac099837afd8770ccc6acfac4c7cff100
i.e. this image uses RHEL8 as a base:
                     ^^^^^

> bash-4.4$ cat /etc/redhat-release
> Red Hat Enterprise Linux release 8.5 (Ootpa)

Previously Postgres container images used RHEL7 as a base.
                                          ^^^^^
For instance, the upstream uses:

> bash-4.2$ cat /etc/redhat-release
> CentOS Linux release 7.8.2003 (Core)

Is Postgres container image base OS change expected?

@khover the posted workaround PR would support RHEL8 fs layout and run Postgres with huge pages disabled.

Best regards!

Comment 12 khover 2022-02-10 11:38:18 UTC

My customer already has hugepages enabled in the cluster.

As per my understanding so far, the workaround is uninstall OCS, disable hugepages, reinstall OCS.

How do I disable huge pages temporarily during OCS installation ?


The customer temp is high so I just want to be sure that what we try next succeeds.

Comment 13 Alexander Indenbaum 2022-02-10 12:08:47 UTC

Comment 14 Alexander Indenbaum 2022-02-10 12:26:37 UTC

@khover any input about why RHEL8 based Postgres container is used?

As a workaround see https://github.com/noobaa/noobaa-operator/pull/853/files, it is a tiny change.

To get through DB initialization, "oc edit cm noobaa-postgres-initdb-sh" and add if block bellow marked by a plus sign. This procedure does not require reinstalling OCS.
 

          # Wrap the postgres binary, force huge_pages=off for initdb
          # see https://bugzilla.redhat.com/show_bug.cgi?id=1946792
          p=/opt/rh/rh-postgresql12/root/usr/bin/postgres
+
+          # Latest RH images moved the postgres binary
+          # from /opt/rh/rh-postgresql12/root/usr/bin/postgres to /usr/bin/postgres
+          # see https://bugzilla.redhat.com/show_bug.cgi?id=2051249
+          if [ ! -x $p ]; then
+            p=/usr/bin/postgres
+          fi
+
          mv $p $p.orig
          echo exec $p.orig \"\$@\" -c huge_pages=off > $p
          chmod 755 $p

Alternatively:
- you could disable huge pages during OCS/NooBaa installation and then re-enabled huge pages
- use RHEL7 based Postgres container

Let me know if you need any additional help.

Best regards!

Comment 15 khover 2022-02-10 14:27:07 UTC

Hi Alex,

Thanks for all your help on this.

Re:

@khover any input about why RHEL8 based Postgres container is used?

I honestly dont know, this is a install of OCS ocs-operator.v4.8.6

IF there is some way to check or info needed id be happy to help you collect.

Comment 26 errata-xmlrpc 2022-04-13 18:52:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372

Comment 27 Red Hat Bugzilla 2023-12-08 04:27:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.