Bug 2280834

Summary: noobaa-db-pg-0 in CLBO post ODF upgrade from 4.14 to 4.15, while OCP is 4.14
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Elad <ebenahar>
Component: Multi-Cloud Object GatewayAssignee: Danny <dzaken>
Status: CLOSED ERRATA QA Contact: Mahesh Shetty <mashetty>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.16CC: dzaken, etamir, nbecker, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.16.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.16.0-130 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2292175 (view as bug list) Environment:
Last Closed: 2024-07-17 13:23:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2292175    

Description Elad 2024-05-16 14:40:22 UTC
Description of problem:

After performing 4.14 to 4.16 EUS to EUS upgrade procedure of ODF and OCP, and performing functional level MCG related operations (tier1 tests), noobaa-db-pg-0 got in CrashLoopBackOff


Version of all relevant components (if applicable):

Initial versions installed:
ODF 4.14 + OCP 4.14

Upgraded versions:
OCP 4.16.0-0.nightly-2024-05-15-001800
odf-operator.v4.16.0-101.stable


Is there any workaround available to the best of your knowledge?
Not aware


Can this issue be reproducible?
Tried this procedure with tier1 test execution only once so far


Can this issue be reproduced from the UI?
N/A


If this is a regression, please provide more details to justify this:
Hard to say if it's a regression at this point, as this is the first time we are trying EUS to EUS upgrade 


Steps to Reproduce:
On IBM Cloud VPC:
1. Install OCP 4.14 (IPI), ODF 4.14
2. Upgrade ODF to 4.14, sequentially (4.14 to 4.15, 4.15 to 4.16)
3. Perform OCP EUS to EUS upgrade procedure:

# oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.16"}}' --type=merge

# oc patch mcp/worker --type merge --patch '{"spec":{"paused":true}}'

# oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-05-15-001800 --allow-explicit-upgrade --force

# oc patch mcp/worker --type merge --patch '{"spec":{"paused":false}}'

Perform functional level operations of NooBaa (tier1 tests of ocs-ci)


Actual results:
As part of the setup phase of test_bucket_creation_deletion.py::TestBucketCreationAndDeletion::test_bucket_creation_deletion[3-S3-DEFAULT-BACKINGSTORE], where the below resources have been created, noobaa-db-pg-0 started crashing.

2024-05-16 14:43:24  tests/functional/object/mcg/test_bucket_creation_deletion.py::TestBucketCreationAndDeletion::test_bucket_creation_deletion[3-S3-DEFAULT-BACKINGSTORE] 
2024-05-16 14:43:24  -------------------------------- live log setup --------------------------------


AWS CLI configMap, s3cli StatefulSet, and AWS, Azure, GCP and IBM Cloud COS secrets.

Started seeing connection to the DB getting broken:

2024-05-16 14:55:52  07:55:40 - ThreadPoolExecutor-30_0 - ocs_ci.utility.utils - INFO  - Executing command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage rsh noobaa-db-pg-0 bash -c "pg_dump nbcore | gzip > /tmp/nbcore.gz"
2024-05-16 14:55:52  07:55:41 - ThreadPoolExecutor-30_0 - ocs_ci.utility.utils - WARNING  - Command stderr: error: unable to upgrade connection: container not found ("db")
2024-05-16 14:55:52  
2024-05-16 14:55:52  07:55:41 - ThreadPoolExecutor-30_0 - ocs_ci.ocs.utils - ERROR  - Failed to dump noobaa DB! Error: Error during execution of command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig -n openshift-storage rsh noobaa-db-pg-0 bash -c "pg_dump nbcore | gzip > /tmp/nbcore.gz".
2024-05-16 14:55:52  Error is error: unable to upgrade connection: container not found ("db")



% oc get pod noobaa-db-pg-0     
NAME             READY   STATUS             RESTARTS          AGE
noobaa-db-pg-0   0/1     CrashLoopBackOff   206 (4m58s ago)   17h



% oc describe pod noobaa-db-pg-0

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   157m (x177 over 17h)    kubelet  Container image "registry.redhat.io/rhel9/postgresql-15@sha256:1aeac23901c0147e4c6e9a1b8bb5f41dd6f95532b0d96adac55d609d0eed32fe" already present on machine
  Warning  BackOff  2m49s (x4742 over 17h)  kubelet  Back-off restarting failed container db in pod noobaa-db-pg-0_openshift-storage(c7d42c55-f3de-45ca-ad7a-408fe7419820)


Noticing this error while trying to get the noobaa-db-pg-0 logs:

Incompatible data directory.  This container image provides
PostgreSQL '15', but data directory is of
version '12'.

This image supports automatic data directory upgrade from
'13', please _carefully_ consult image documentation
about how to use the '$POSTGRESQL_UPGRADE' startup option.



Additional info:
ODF Must Gather - https://url.corp.redhat.com/dc96623
Live cluster details - https://url.corp.redhat.com/331727d

Comment 3 Elad 2024-05-16 14:53:20 UTC
Correction:

Steps to Reproduce:
2. Upgrade ODF to **4.16**, sequentially (4.14 to 4.15, 4.15 to 4.16)

Comment 4 Danny 2024-05-20 07:30:16 UTC
In the DB logs, it appears that the DB data directory is still in Postgres 12 format and not Postgres 15, a change that should have happened in 4.15. Was the upgrade to 4.15 successful before proceeding to 4.16?

Comment 18 Sunil Kumar Acharya 2024-06-25 12:09:21 UTC
Please update the RDT flag/text appropriately.

Comment 19 errata-xmlrpc 2024-07-17 13:23:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591