2280834 – noobaa-db-pg-0 in CLBO post ODF upgrade from 4.14 to 4.15, while OCP is 4.14

Bug 2280834 - noobaa-db-pg-0 in CLBO post ODF upgrade from 4.14 to 4.15, while OCP is 4.14

Summary: noobaa-db-pg-0 in CLBO post ODF upgrade from 4.14 to 4.15, while OCP is 4.14

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Danny
QA Contact:	Mahesh Shetty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2292175
TreeView+	depends on / blocked

Reported:	2024-05-16 14:40 UTC by Elad
Modified:	2024-07-30 10:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.16.0-130
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2292175 (view as bug list)
Environment:
Last Closed:	2024-07-17 13:23:02 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-operator pull 1380	None	Merged	[Direct to 5.15] Added error handling and retries in postgres upgrade scripts	2024-06-19 12:45:56 UTC
Github	noobaa noobaa-operator pull 1381	None	Merged	[Direct to 5.15] removed redundant code	2024-06-19 14:19:13 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:23:10 UTC

Description Elad 2024-05-16 14:40:22 UTC

Description of problem:

After performing 4.14 to 4.16 EUS to EUS upgrade procedure of ODF and OCP, and performing functional level MCG related operations (tier1 tests), noobaa-db-pg-0 got in CrashLoopBackOff


Version of all relevant components (if applicable):

Initial versions installed:
ODF 4.14 + OCP 4.14

Upgraded versions:
OCP 4.16.0-0.nightly-2024-05-15-001800
odf-operator.v4.16.0-101.stable


Is there any workaround available to the best of your knowledge?
Not aware


Can this issue be reproducible?
Tried this procedure with tier1 test execution only once so far


Can this issue be reproduced from the UI?
N/A


If this is a regression, please provide more details to justify this:
Hard to say if it's a regression at this point, as this is the first time we are trying EUS to EUS upgrade 


Steps to Reproduce:
On IBM Cloud VPC:
1. Install OCP 4.14 (IPI), ODF 4.14
2. Upgrade ODF to 4.14, sequentially (4.14 to 4.15, 4.15 to 4.16)
3. Perform OCP EUS to EUS upgrade procedure:

# oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.16"}}' --type=merge

# oc patch mcp/worker --type merge --patch '{"spec":{"paused":true}}'

# oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-05-15-001800 --allow-explicit-upgrade --force

# oc patch mcp/worker --type merge --patch '{"spec":{"paused":false}}'

Perform functional level operations of NooBaa (tier1 tests of ocs-ci)


Actual results:
As part of the setup phase of test_bucket_creation_deletion.py::TestBucketCreationAndDeletion::test_bucket_creation_deletion[3-S3-DEFAULT-BACKINGSTORE], where the below resources have been created, noobaa-db-pg-0 started crashing.

2024-05-16 14:43:24  tests/functional/object/mcg/test_bucket_creation_deletion.py::TestBucketCreationAndDeletion::test_bucket_creation_deletion[3-S3-DEFAULT-BACKINGSTORE] 
2024-05-16 14:43:24  -------------------------------- live log setup --------------------------------


AWS CLI configMap, s3cli StatefulSet, and AWS, Azure, GCP and IBM Cloud COS secrets.

Started seeing connection to the DB getting broken:

2024-05-16 14:55:52  07:55:40 - ThreadPoolExecutor-30_0 - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage rsh noobaa-db-pg-0 bash -c "pg_dump nbcore | gzip > /tmp/nbcore.gz"
2024-05-16 14:55:52  07:55:41 - ThreadPoolExecutor-30_0 - ocs_ci.utility.utils - WARNING  - Command stderr: error: unable to upgrade connection: container not found ("db")
2024-05-16 14:55:52  
2024-05-16 14:55:52  07:55:41 - ThreadPoolExecutor-30_0 - ocs_ci.ocs.utils - ERROR  - Failed to dump noobaa DB! Error: Error during execution of command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage rsh noobaa-db-pg-0 bash -c "pg_dump nbcore | gzip > /tmp/nbcore.gz".
2024-05-16 14:55:52  Error is error: unable to upgrade connection: container not found ("db")



% oc get pod noobaa-db-pg-0     
NAME             READY   STATUS             RESTARTS          AGE
noobaa-db-pg-0   0/1     CrashLoopBackOff   206 (4m58s ago)   17h



% oc describe pod noobaa-db-pg-0

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   157m (x177 over 17h)    kubelet  Container image "registry.redhat.io/rhel9/postgresql-15@sha256:1aeac23901c0147e4c6e9a1b8bb5f41dd6f95532b0d96adac55d609d0eed32fe" already present on machine
  Warning  BackOff  2m49s (x4742 over 17h)  kubelet  Back-off restarting failed container db in pod noobaa-db-pg-0_openshift-storage(c7d42c55-f3de-45ca-ad7a-408fe7419820)


Noticing this error while trying to get the noobaa-db-pg-0 logs:

Incompatible data directory.  This container image provides
PostgreSQL '15', but data directory is of
version '12'.

This image supports automatic data directory upgrade from
'13', please _carefully_ consult image documentation
about how to use the '$POSTGRESQL_UPGRADE' startup option.



Additional info:
ODF Must Gather - https://url.corp.redhat.com/dc96623
Live cluster details - https://url.corp.redhat.com/331727d

Comment 3 Elad 2024-05-16 14:53:20 UTC

Correction:

Steps to Reproduce:
2. Upgrade ODF to **4.16**, sequentially (4.14 to 4.15, 4.15 to 4.16)

Comment 4 Danny 2024-05-20 07:30:16 UTC

In the DB logs, it appears that the DB data directory is still in Postgres 12 format and not Postgres 15, a change that should have happened in 4.15. Was the upgrade to 4.15 successful before proceeding to 4.16?

Comment 18 Sunil Kumar Acharya 2024-06-25 12:09:21 UTC

Please update the RDT flag/text appropriately.

Comment 19 errata-xmlrpc 2024-07-17 13:23:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.