Bug 1922113

Summary: noobaa-db pod init container is crashing after OCS upgrade from OCS 4.6 to OCS 4.7
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: suchita <sgatfane>
Component: Multi-Cloud Object GatewayAssignee: Danny <dzaken>
Status: CLOSED ERRATA QA Contact: Petr Balogh <pbalogh>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: apolak, dzaken, ebenahar, etamir, muagarwa, musoni, nbecker, nberry, ocs-bugs, pbalogh, ratamir, sgatfane
Target Milestone: ---Keywords: AutomationBackLog, UpgradeBlocker
Target Release: OCS 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.7.0-306 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-19 09:18:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description suchita 2021-01-29 09:37:24 UTC
Similar to BZ 1915698

Description of problem (please be detailed as possible and provide log
snippests):

On Azure platform after OCS upgrade from 4.6 to OCS 4.7,
I see all noobaa pods are not in Running state

noobaa-db-0                                                       1/1     Running     0          14h
noobaa-db-pg-0                                                    0/1     Init:0/1    128        14h
noobaa-operator-9995644ff-b2zn7                                   1/1     Running     0          14h


Version of all relevant components (if applicable):
OCP: 4.7.0-fc.4
OCS: ocs-operator.v4.7.0-241.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, Blocking the upgrade automation on [azure platform]

Is there any workaround available to the best of your knowledge?
No, I am not aware of it for now. 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes. reproducibility till time 1/1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install OCP 4.6 and OCS 4.6
2.Upgrade OCP to 4.7 (Auto)
3.verify OCP upgraded successfully
4.Upgrade OCS to 4.7


Actual results:
No noobaa-core-0 pod exists after the upgrade

Expected results:
noobaa core pod running. 

Additional info:

Job: 
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16799/console

Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/ocp_must_gather/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-8099d74217f9305c717cb1a157a6a89f5e810834edd9dfd80b89484263e6cc62/namespaces/openshift-storage/pods/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_check_mon_pdb_post_upgrade_ocs_logs/

Comment 2 suchita 2021-01-29 09:40:42 UTC
*** Bug 1922114 has been marked as a duplicate of this bug. ***

Comment 5 suchita 2021-01-29 10:36:04 UTC
Please let me know, today I am keeping the cluster alive for further debugging if needed.

Comment 6 Danny 2021-01-31 17:42:24 UTC
the upgrade sequence causes the following scenario to happen:

1. before the upgrade (4.6) we have a mongo running (noobaa-db-0) with init image pointing to mcg 4.6. the init image is the same core image that is set in noobaa CR
2. after the upgrade a new version of noobaa-operator is starting. in that version, we are starting a Postgres DB statefulset which also has an init container that is using the mcg image. since noobaa CR is not updated yet with the new image (this is done by ocs-operator) the Postgres init container is trying to run with the old mcg image that doesn't contain the Postgres init code
3. Postgres init is stuck and can't complete. in the meantime, ocs-operator is updating the noobaa CR with the new image
4. Postgres should have restarted once the init container is updated, but for some reason, it doesn't happen. since Postgres DB can't start then noobaa-core pod will not start until the migration is completed

@sgatfane, I tried to reproduce it locally, but so far with no success. If it occurs again then a live cluster can help us to test a solution

Comment 9 Danny 2021-02-02 14:55:48 UTC
this issue will be fixed by sending the db_type=postgres only when the new image is set in the CR. I added this to OCS-operator noobaa reconciler. once this is merged we will remove the DS patch that sets the default to postgres. this way we will not treat the deployment as postgres before we are using an image that supports it

Comment 10 Nimrod Becker 2021-02-15 09:23:58 UTC
*** Bug 1928509 has been marked as a duplicate of this bug. ***

Comment 11 Neha Berry 2021-02-15 12:49:34 UTC
Similar issue found in AWS as well - https://bugzilla.redhat.com/show_bug.cgi?id=1928509

Bug marked as duplicate as RCA is same. Hence this bug is not just Azure specific.

Comment 13 Danny 2021-02-22 12:42:30 UTC
For some reason, the init container image of the mongo statefulset pod was not updated (although the STS itself does have an updated image). 
I am still investigating to find the issue. 

@apolak if you have a live cluster with the issue it will help with the investigation

Comment 15 Danny 2021-02-25 16:48:34 UTC
Aviad ran 3 attempts to repro, with no success (upgrade succeeded on all 3). once we'll have a live cluster we can try and identify the root cause

Comment 21 Petr Balogh 2021-03-18 13:21:26 UTC
I think that it failed again here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/245/
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j002ai3c33-uan/j002ai3c33-uan_20210318T082901/logs/failed_testcase_ocs_logs_1616059248/test_upgrade_ocs_logs/

I see:
noobaa-db-0 0/1 Init:CrashLoopBackOff 

Upgrade from 4.6 to v4.7.0-299.ci.

Based on this execution and also another one 2 days back:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/208/

I am failing QE and moving back to ASSIGNED.

Comment 23 Petr Balogh 2021-03-18 13:48:13 UTC
the thing is that the build was not tagged as stable hence our automation is not taking this build as the default.

I manually triggered upgrade to the build 306 here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-nightly/3/

Petr

Comment 24 Petr Balogh 2021-03-25 11:53:08 UTC
I see upgrade passed this time so marking this BZ as verified.

Details about run:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/9/
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c33-ua/j009ai3c33-ua_20210325T092816

Upgrade from 4.6.3 live to 4.7 internal build quay.io/rhceph-dev/ocs-registry:4.7.0-318.ci .

Platform AWS
OCP: 4.7.0-0.nightly-2021-03-25-045200

Comment 26 errata-xmlrpc 2021-05-19 09:18:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041