1922113 – noobaa-db pod init container is crashing after OCS upgrade from OCS 4.6 to OCS 4.7

Bug 1922113 - noobaa-db pod init container is crashing after OCS upgrade from OCS 4.6 to OCS 4.7

Summary: noobaa-db pod init container is crashing after OCS upgrade from OCS 4.6 to OC...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Danny
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1922114 1928509 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-29 09:37 UTC by suchita
Modified:	2021-06-01 08:49 UTC (History)
CC List:	12 users (show)
Fixed In Version:	4.7.0-306
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:18:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-operator pull 558	None	closed	Backport to 5.7	2021-03-16 18:36:56 UTC
Github	noobaa noobaa-operator pull 578/files	None	None	None	2021-03-16 17:05:51 UTC
Github	noobaa noobaa-operator pull 583	None	closed	Backport to 5.7	2021-03-16 18:37:01 UTC
Github	openshift ocs-operator pull 1023	None	closed	added DBType=postgres to noobaa CR	2021-03-16 17:05:52 UTC
Github	openshift ocs-operator pull 1080	None	closed	Bug 1922113: [release-4.7] - added DBType=postgres to noobaa CR	2021-03-16 18:36:58 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:19:02 UTC

Description suchita 2021-01-29 09:37:24 UTC

Similar to BZ 1915698

Description of problem (please be detailed as possible and provide log
snippests):

On Azure platform after OCS upgrade from 4.6 to OCS 4.7,
I see all noobaa pods are not in Running state

noobaa-db-0                                                       1/1     Running     0          14h
noobaa-db-pg-0                                                    0/1     Init:0/1    128        14h
noobaa-operator-9995644ff-b2zn7                                   1/1     Running     0          14h


Version of all relevant components (if applicable):
OCP: 4.7.0-fc.4
OCS: ocs-operator.v4.7.0-241.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, Blocking the upgrade automation on [azure platform]

Is there any workaround available to the best of your knowledge?
No, I am not aware of it for now. 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes. reproducibility till time 1/1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install OCP 4.6 and OCS 4.6
2.Upgrade OCP to 4.7 (Auto)
3.verify OCP upgraded successfully
4.Upgrade OCS to 4.7


Actual results:
No noobaa-core-0 pod exists after the upgrade

Expected results:
noobaa core pod running. 

Additional info:

Job: 
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16799/console

Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/ocp_must_gather/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-8099d74217f9305c717cb1a157a6a89f5e810834edd9dfd80b89484263e6cc62/namespaces/openshift-storage/pods/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_check_mon_pdb_post_upgrade_ocs_logs/

Comment 2 suchita 2021-01-29 09:40:42 UTC

*** Bug 1922114 has been marked as a duplicate of this bug. ***

Comment 5 suchita 2021-01-29 10:36:04 UTC

Please let me know, today I am keeping the cluster alive for further debugging if needed.

Comment 6 Danny 2021-01-31 17:42:24 UTC

the upgrade sequence causes the following scenario to happen:

1. before the upgrade (4.6) we have a mongo running (noobaa-db-0) with init image pointing to mcg 4.6. the init image is the same core image that is set in noobaa CR
2. after the upgrade a new version of noobaa-operator is starting. in that version, we are starting a Postgres DB statefulset which also has an init container that is using the mcg image. since noobaa CR is not updated yet with the new image (this is done by ocs-operator) the Postgres init container is trying to run with the old mcg image that doesn't contain the Postgres init code
3. Postgres init is stuck and can't complete. in the meantime, ocs-operator is updating the noobaa CR with the new image
4. Postgres should have restarted once the init container is updated, but for some reason, it doesn't happen. since Postgres DB can't start then noobaa-core pod will not start until the migration is completed

@sgatfane, I tried to reproduce it locally, but so far with no success. If it occurs again then a live cluster can help us to test a solution

Comment 9 Danny 2021-02-02 14:55:48 UTC

this issue will be fixed by sending the db_type=postgres only when the new image is set in the CR. I added this to OCS-operator noobaa reconciler. once this is merged we will remove the DS patch that sets the default to postgres. this way we will not treat the deployment as postgres before we are using an image that supports it

Comment 10 Nimrod Becker 2021-02-15 09:23:58 UTC

*** Bug 1928509 has been marked as a duplicate of this bug. ***

Comment 11 Neha Berry 2021-02-15 12:49:34 UTC

Similar issue found in AWS as well - https://bugzilla.redhat.com/show_bug.cgi?id=1928509

Bug marked as duplicate as RCA is same. Hence this bug is not just Azure specific.

Comment 13 Danny 2021-02-22 12:42:30 UTC

For some reason, the init container image of the mongo statefulset pod was not updated (although the STS itself does have an updated image). 
I am still investigating to find the issue. 

@apolak if you have a live cluster with the issue it will help with the investigation

Comment 15 Danny 2021-02-25 16:48:34 UTC

Aviad ran 3 attempts to repro, with no success (upgrade succeeded on all 3). once we'll have a live cluster we can try and identify the root cause

Comment 21 Petr Balogh 2021-03-18 13:21:26 UTC

I think that it failed again here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/245/
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j002ai3c33-uan/j002ai3c33-uan_20210318T082901/logs/failed_testcase_ocs_logs_1616059248/test_upgrade_ocs_logs/

I see:
noobaa-db-0 0/1 Init:CrashLoopBackOff 

Upgrade from 4.6 to v4.7.0-299.ci.

Based on this execution and also another one 2 days back:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/208/

I am failing QE and moving back to ASSIGNED.

Comment 23 Petr Balogh 2021-03-18 13:48:13 UTC

the thing is that the build was not tagged as stable hence our automation is not taking this build as the default.

I manually triggered upgrade to the build 306 here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-nightly/3/

Petr

Comment 24 Petr Balogh 2021-03-25 11:53:08 UTC

I see upgrade passed this time so marking this BZ as verified.

Details about run:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/9/
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c33-ua/j009ai3c33-ua_20210325T092816

Upgrade from 4.6.3 live to 4.7 internal build quay.io/rhceph-dev/ocs-registry:4.7.0-318.ci .

Platform AWS
OCP: 4.7.0-0.nightly-2021-03-25-045200

Comment 26 errata-xmlrpc 2021-05-19 09:18:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.