Bug 1922113
| Summary: | noobaa-db pod init container is crashing after OCS upgrade from OCS 4.6 to OCS 4.7 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | suchita <sgatfane> |
| Component: | Multi-Cloud Object Gateway | Assignee: | Danny <dzaken> |
| Status: | CLOSED ERRATA | QA Contact: | Petr Balogh <pbalogh> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | apolak, dzaken, ebenahar, etamir, muagarwa, musoni, nbecker, nberry, ocs-bugs, pbalogh, ratamir, sgatfane |
| Target Milestone: | --- | Keywords: | AutomationBackLog, UpgradeBlocker |
| Target Release: | OCS 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.7.0-306 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-19 09:18:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
suchita
2021-01-29 09:37:24 UTC
*** Bug 1922114 has been marked as a duplicate of this bug. *** Please let me know, today I am keeping the cluster alive for further debugging if needed. the upgrade sequence causes the following scenario to happen: 1. before the upgrade (4.6) we have a mongo running (noobaa-db-0) with init image pointing to mcg 4.6. the init image is the same core image that is set in noobaa CR 2. after the upgrade a new version of noobaa-operator is starting. in that version, we are starting a Postgres DB statefulset which also has an init container that is using the mcg image. since noobaa CR is not updated yet with the new image (this is done by ocs-operator) the Postgres init container is trying to run with the old mcg image that doesn't contain the Postgres init code 3. Postgres init is stuck and can't complete. in the meantime, ocs-operator is updating the noobaa CR with the new image 4. Postgres should have restarted once the init container is updated, but for some reason, it doesn't happen. since Postgres DB can't start then noobaa-core pod will not start until the migration is completed @sgatfane, I tried to reproduce it locally, but so far with no success. If it occurs again then a live cluster can help us to test a solution this issue will be fixed by sending the db_type=postgres only when the new image is set in the CR. I added this to OCS-operator noobaa reconciler. once this is merged we will remove the DS patch that sets the default to postgres. this way we will not treat the deployment as postgres before we are using an image that supports it *** Bug 1928509 has been marked as a duplicate of this bug. *** Similar issue found in AWS as well - https://bugzilla.redhat.com/show_bug.cgi?id=1928509 Bug marked as duplicate as RCA is same. Hence this bug is not just Azure specific. For some reason, the init container image of the mongo statefulset pod was not updated (although the STS itself does have an updated image). I am still investigating to find the issue. @apolak if you have a live cluster with the issue it will help with the investigation Aviad ran 3 attempts to repro, with no success (upgrade succeeded on all 3). once we'll have a live cluster we can try and identify the root cause I think that it failed again here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/245/ Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j002ai3c33-uan/j002ai3c33-uan_20210318T082901/logs/failed_testcase_ocs_logs_1616059248/test_upgrade_ocs_logs/ I see: noobaa-db-0 0/1 Init:CrashLoopBackOff Upgrade from 4.6 to v4.7.0-299.ci. Based on this execution and also another one 2 days back: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/208/ I am failing QE and moving back to ASSIGNED. the thing is that the build was not tagged as stable hence our automation is not taking this build as the default. I manually triggered upgrade to the build 306 here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-nightly/3/ Petr I see upgrade passed this time so marking this BZ as verified. Details about run: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/9/ Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c33-ua/j009ai3c33-ua_20210325T092816 Upgrade from 4.6.3 live to 4.7 internal build quay.io/rhceph-dev/ocs-registry:4.7.0-318.ci . Platform AWS OCP: 4.7.0-0.nightly-2021-03-25-045200 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |