Similar to BZ 1915698 Description of problem (please be detailed as possible and provide log snippests): On Azure platform after OCS upgrade from 4.6 to OCS 4.7, I see all noobaa pods are not in Running state noobaa-db-0 1/1 Running 0 14h noobaa-db-pg-0 0/1 Init:0/1 128 14h noobaa-operator-9995644ff-b2zn7 1/1 Running 0 14h Version of all relevant components (if applicable): OCP: 4.7.0-fc.4 OCS: ocs-operator.v4.7.0-241.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, Blocking the upgrade automation on [azure platform] Is there any workaround available to the best of your knowledge? No, I am not aware of it for now. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes. reproducibility till time 1/1 Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install OCP 4.6 and OCS 4.6 2.Upgrade OCP to 4.7 (Auto) 3.verify OCP upgraded successfully 4.Upgrade OCS to 4.7 Actual results: No noobaa-core-0 pod exists after the upgrade Expected results: noobaa core pod running. Additional info: Job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16799/console Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/ocp_must_gather/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-8099d74217f9305c717cb1a157a6a89f5e810834edd9dfd80b89484263e6cc62/namespaces/openshift-storage/pods/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-2801/sgatfane-2801_20210128T053058/logs/failed_testcase_ocs_logs_1611852615/test_check_mon_pdb_post_upgrade_ocs_logs/
*** Bug 1922114 has been marked as a duplicate of this bug. ***
Please let me know, today I am keeping the cluster alive for further debugging if needed.
the upgrade sequence causes the following scenario to happen: 1. before the upgrade (4.6) we have a mongo running (noobaa-db-0) with init image pointing to mcg 4.6. the init image is the same core image that is set in noobaa CR 2. after the upgrade a new version of noobaa-operator is starting. in that version, we are starting a Postgres DB statefulset which also has an init container that is using the mcg image. since noobaa CR is not updated yet with the new image (this is done by ocs-operator) the Postgres init container is trying to run with the old mcg image that doesn't contain the Postgres init code 3. Postgres init is stuck and can't complete. in the meantime, ocs-operator is updating the noobaa CR with the new image 4. Postgres should have restarted once the init container is updated, but for some reason, it doesn't happen. since Postgres DB can't start then noobaa-core pod will not start until the migration is completed @sgatfane, I tried to reproduce it locally, but so far with no success. If it occurs again then a live cluster can help us to test a solution
this issue will be fixed by sending the db_type=postgres only when the new image is set in the CR. I added this to OCS-operator noobaa reconciler. once this is merged we will remove the DS patch that sets the default to postgres. this way we will not treat the deployment as postgres before we are using an image that supports it
*** Bug 1928509 has been marked as a duplicate of this bug. ***
Similar issue found in AWS as well - https://bugzilla.redhat.com/show_bug.cgi?id=1928509 Bug marked as duplicate as RCA is same. Hence this bug is not just Azure specific.
For some reason, the init container image of the mongo statefulset pod was not updated (although the STS itself does have an updated image). I am still investigating to find the issue. @apolak if you have a live cluster with the issue it will help with the investigation
Aviad ran 3 attempts to repro, with no success (upgrade succeeded on all 3). once we'll have a live cluster we can try and identify the root cause
I think that it failed again here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/245/ Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j002ai3c33-uan/j002ai3c33-uan_20210318T082901/logs/failed_testcase_ocs_logs_1616059248/test_upgrade_ocs_logs/ I see: noobaa-db-0 0/1 Init:CrashLoopBackOff Upgrade from 4.6 to v4.7.0-299.ci. Based on this execution and also another one 2 days back: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/208/ I am failing QE and moving back to ASSIGNED.
the thing is that the build was not tagged as stable hence our automation is not taking this build as the default. I manually triggered upgrade to the build 306 here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto-nightly/3/ Petr
I see upgrade passed this time so marking this BZ as verified. Details about run: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Upgrade-OCS/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/9/ Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j009ai3c33-ua/j009ai3c33-ua_20210325T092816 Upgrade from 4.6.3 live to 4.7 internal build quay.io/rhceph-dev/ocs-registry:4.7.0-318.ci . Platform AWS OCP: 4.7.0-0.nightly-2021-03-25-045200
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041