Bug 2216626

Summary:	Noobaa Postgres Container will not Start, noobaa db stuck in CLBO
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Anjali <amenon>
Component:	Multi-Cloud Object Gateway	Assignee:	Utkarsh Srivastava <usrivast>
Status:	CLOSED WONTFIX	QA Contact:	krishnaram Karthick <kramdoss>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.11	CC:	hnallurv, nbecker, odf-bz-bot, smitra, usrivast
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-29 11:47:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Anjali 2023-06-22 05:54:38 UTC

Description of problem (please be detailed as possible and provide log
snippests):

- noobaa db stuck in CLBO, all other pods in openshift-storage NS are up an running

noobaa-db-pg-0                                                    0/1     CrashLoopBackOff   219 (97s ago)   18h   10.126.12.20    oscinfra-ldc65-storage-mhbxg   <none>           <none>

97s         Normal    Pulled           pod/noobaa-db-pg-0                                                        Container image "registry.redhat.io/rhel8/postgresql-12@sha256:aa65868b9684f7715214f5f3fac3139245c212019cc17742f237965a7508222d" already present on machine
6m34s       Warning   BackOff          pod/noobaa-db-pg-0                                                        Back-off restarting failed container

- Checking pod logs, we are seeing error ERROR:  tuple already updated by self 

[amenon@supportshell-1 logs]$ cat current.log 
2023-06-13T13:02:53.407141058Z pg_ctl: another server might be running; trying to start server anyway
2023-06-13T13:02:53.417167652Z waiting for server to start....2023-06-13 13:02:53.434 UTC [22] LOG:  starting PostgreSQL 12.12 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit
2023-06-13T13:02:53.434720093Z 2023-06-13 13:02:53.434 UTC [22] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-06-13T13:02:53.439280827Z 2023-06-13 13:02:53.439 UTC [22] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-06-13T13:02:53.521852556Z 2023-06-13 13:02:53.521 UTC [22] LOG:  redirecting log output to logging collector process
2023-06-13T13:02:53.521852556Z 2023-06-13 13:02:53.521 UTC [22] HINT:  Future log output will appear in directory "log".
2023-06-13T13:02:53.717865176Z  done
2023-06-13T13:02:53.717865176Z server started
2023-06-13T13:02:53.726249227Z /var/run/postgresql:5432 - accepting connections
2023-06-13T13:02:53.730107559Z => sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
2023-06-13T13:02:53.737096718Z ERROR:  tuple already updated by self

[amenon@supportshell-1 logs]$ cat current.log 
2023-06-12T18:47:52.329814485Z + export PGDATA=/var/lib/pgsql/data/userdata
2023-06-12T18:47:52.329814485Z + PGDATA=/var/lib/pgsql/data/userdata
2023-06-12T18:47:52.329892997Z postgresql.conf file is found
2023-06-12T18:47:52.329900468Z + '[' -f /var/lib/pgsql/data/userdata/postgresql.conf ']'
2023-06-12T18:47:52.329900468Z + echo postgresql.conf file is found
2023-06-12T18:47:52.329900468Z + exit 0

Version of all relevant components (if applicable):
- ODF version 4.11.8
- Cluster version is 4.12.19

- This is similar to issue addressed in Bug https://bugzilla.redhat.com/show_bug.cgi?id=2010702

- We already applied KCS https://access.redhat.com/solutions/7011877 but didn't help. 

- The error we get when applying above KCS is that when we try to stop Postgres, it says: 

sh-4.4$ pg_ctl stop -D /var/lib/pgsql/data/userdata
pg_ctl: could not send stop signal (PID: 22): No such process

- We also tried below steps where we add an extra step 2, to Run run-postgresql  which should create those files and be able to run the remaining steps, but it is also not helping. 

1. Start a debug session using oc debug pod/noobaa-db-pg-0

2. From the cmd line of the debug session; Run run-postgresql
   
3. Run pg_ctl stop -D /var/lib/pgsql/data/userdata to cleanly shutdown Postgres. 

4. Run pg_ctl start -D /var/lib/pgsql/data/userdata to start Postgres. You should see the output as mentioned in [1] and it should wait there indefinitly (no errors):

5. Press enter.

6. Run pg_ctl stop -D /var/lib/pgsql/data/userdata and wait Postgres to shutdown cleanly.
   
7. Exit the debug session

- After trying above KCS, the error we are seeing in noobaa-db-pg-0 pod logs is 

2023-06-16T12:34:30.425331231Z pg_ctl: another server might be running; trying to start server anyway
2023-06-16T12:34:30.434503981Z waiting for server to start....2023-06-16 12:34:30.497 UTC [22] LOG:  starting PostgreSQL 12.12 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit
2023-06-16T12:34:30.498224687Z 2023-06-16 12:34:30.498 UTC [22] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-06-16T12:34:30.502843334Z 2023-06-16 12:34:30.502 UTC [22] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-06-16T12:34:30.560324086Z 2023-06-16 12:34:30.560 UTC [22] LOG:  redirecting log output to logging collector process
2023-06-16T12:34:30.560324086Z 2023-06-16 12:34:30.560 UTC [22] HINT:  Future log output will appear in directory "log".
2023-06-16T12:34:30.735134682Z  done
2023-06-16T12:34:30.735134682Z server started
2023-06-16T12:34:30.744075842Z /var/run/postgresql:5432 - accepting connections
2023-06-16T12:34:30.748291698Z => sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
2023-06-16T12:34:30.754964819Z ERROR:  tuple concurrently updated

- Need help from engineering on how we can proceed.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
This is an infrastructure cluster for testing changes and upgrades. The issue is not allowing cu to do testing. 

Is there any workaround available to the best of your knowledge?
No

Additional info:
- all m-gs are attached to supportshell under ~/03536312