Bug 1955328

Summary: Upgrade of noobaa DB failed when upgrading OCS 4.6 to 4.7
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Petr Balogh <pbalogh>
Component: Multi-Cloud Object GatewayAssignee: Danny <dzaken>
Status: CLOSED ERRATA QA Contact: Raz Tamir <ratamir>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.7CC: assingh, dzaken, ebenahar, etamir, mbukatov, muagarwa, nbecker, ocs-bugs, palshure
Target Milestone: ---Keywords: Automation, Regression
Target Release: OCS 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.7.0-377.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1956256 (view as bug list) Environment:
Last Closed: 2021-05-19 09:21:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1956256    

Description Petr Balogh 2021-04-29 21:06:09 UTC
Description of problem (please be detailed as possible and provide log
snippests):
This job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/649/consoleFull

Failed OCS upgrade test as it has not expected number of the pods. When I looked at the must gather I see those pods:
noobaa-db-0                                                       1/1     Running     0          24m     10.129.2.33    ip-10-0-190-240.us-east-2.compute.internal   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running     0          27m     10.129.2.31    ip-10-0-190-240.us-east-2.compute.internal   <none>           <none>
noobaa-operator-b8cd8767-mn7w5                                    1/1     Running     0          27m     10.131.0.94    ip-10-0-137-77.us-east-2.compute.internal    <none>           <none>
noobaa-upgrade-job-2p2tn                                          0/1     Error       0          6m21s   10.129.2.50    ip-10-0-190-240.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-2x9n4                                          0/1     Error       0          12m     10.129.2.43    ip-10-0-190-240.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-8j94d                                          0/1     Error       0          17m     10.129.2.38    ip-10-0-190-240.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-p7dcl                                          0/1     Error       0          22m     10.129.2.34    ip-10-0-190-240.us-east-2.compute.internal   <none>           <none>

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018ai3c33-ua/j018ai3c33-ua_20210429T163029/logs/failed_testcase_ocs_logs_1619717324/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-0e929cb3857e60e2f154be3ce2f4a2aa2924b2e660e5cf96b9a5f64897a0d072/namespaces/openshift-storage/oc_output/pods_-owide


Version of all relevant components (if applicable):
OCS 4.7.0-364.ci
OCP 4.7.0-0.nightly-2021-04-29-115807


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Not sure yet, first time I see the issue.


Can this issue reproduce from the UI?
Haven't tried


If this is a regression, please provide more details to justify this:
Yes


Steps to Reproduce:
1. Install OCS 4.6.4
2. Upgrade to latest RC OCS 4.7 build
3.


Actual results:
Noobaa DB is failing to be upgraded

Expected results:
Have noobaa DB upgraded


Additional info:
Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/649
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018ai3c33-ua/j018ai3c33-ua_20210429T163029/logs/failed_testcase_ocs_logs_1619717324/test_upgrade_ocs_logs/

Comment 11 Petr Balogh 2021-05-06 07:13:08 UTC
We have another occurrence of this bug here

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/687/

With usage of build:
quay.io/rhceph-dev/ocs-registry:4.7.0-377.ci

This time on this env type:
AWS IPI FIPS ENCRYPTION 3AZ RHCOS 3Masters 3Workers 3Infra nodes

First one was on:
AWS IPI 3AZ RHCOS 3Masters 3Workers


http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003aife3c333-ua/j003aife3c333-ua_20210505T080105/logs/failed_testcase_ocs_logs_1620205433/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-76da8d529f412bb79d33d99fec3d180953c257b904fbbd49f102d5637b17fc04/namespaces/openshift-storage/oc_output/pods_-owide

Here I see:
noobaa-db-0                                                       1/1     Running     0          14m     10.130.2.24    ip-10-0-219-56.us-east-2.compute.internal    <none>           <none>
noobaa-db-pg-0                                                    1/1     Running     0          15m     10.130.2.23    ip-10-0-219-56.us-east-2.compute.internal    <none>           <none>
noobaa-operator-7c64ddbcb-pd7mn                                   1/1     Running     0          15m     10.129.2.23    ip-10-0-150-221.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-5wjbk                                          0/1     Error       0          12m     10.129.2.25    ip-10-0-150-221.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-crlrl                                          0/1     Error       0          10m     10.129.2.29    ip-10-0-150-221.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-rz2cr                                          0/1     Error       0          11m     10.129.2.27    ip-10-0-150-221.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-s8c6j                                          0/1     Error       0          12m     10.129.2.26    ip-10-0-150-221.us-east-2.compute.internal   <none>           <none>
noobaa-upgrade-job-wdfhq                                          0/1     Error       0          11m     10.129.2.28    ip-10-0-150-221.us-east-2.compute.internal   <none>           <none>

Full must gather logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003aife3c333-ua/j003aife3c333-ua_20210505T080105/logs/failed_testcase_ocs_logs_1620205433/test_upgrade_ocs_logs/

Comment 18 Petr Balogh 2021-05-06 08:13:12 UTC
OK, opened the new BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1957639

I will run few more verification before marking this as verified.

Comment 19 Petr Balogh 2021-05-07 10:53:31 UTC
We haven't seen this issue in the last two RC build when we ran a lot of upgrade testing.

Here I am adding just one of the upgrade job from the same combination where we hit this issue:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/707

Log path:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j021ai3c33-ua/j021ai3c33-ua_20210506T231908

Hence marking as verified.

Comment 21 errata-xmlrpc 2021-05-19 09:21:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041