Bug 1969309
Summary: | [IBM Z] : Upgrade from 4.7 4.7.1 fails during the ocs-ci upgrade test | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Sravika <sbalusu> | ||||||||||||||
Component: | Multi-Cloud Object Gateway | Assignee: | Nimrod Becker <nbecker> | ||||||||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Raz Tamir <ratamir> | ||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||
Priority: | unspecified | ||||||||||||||||
Version: | 4.7 | CC: | etamir, madam, muagarwa, ocs-bugs, rcyriac, svenkat | ||||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2021-09-07 13:54:57 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Created attachment 1789332 [details]
ocs-ci_upgrade__test_html
Upgraded ocs 4.7.0 to latest rc2, 4.7.1-410.ci. The ocs-operator got updated from 4.7.0 to 4.7.1. However, during the upgrade one of the workers went to “Not Ready” state leaving the pods on that worker in Terminating/Pending state. Also the ceph-rgw (rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c45c58kdqg6) pod has been restarted several times. ]# oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-local-storage local-storage-operator.4.7.0-202105210300.p0 Local Storage 4.7.0-202105210300.p0 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.17.0 Succeeded openshift-storage ocs-operator.v4.7.1-410.ci OpenShift Container Storage 4.7.1-410.ci ocs-operator.v4.7.0 Succeeded # oc get nodes NAME STATUS ROLES AGE VERSION master-0.m1312001ocs.lnxne.boe Ready master 4h51m v1.20.0+df9c838 master-1.m1312001ocs.lnxne.boe Ready master 4h50m v1.20.0+df9c838 master-2.m1312001ocs.lnxne.boe Ready master 4h51m v1.20.0+df9c838 worker-0.m1312001ocs.lnxne.boe NotReady worker 4h41m v1.20.0+df9c838 worker-1.m1312001ocs.lnxne.boe Ready worker 4h40m v1.20.0+df9c838 worker-2.m1312001ocs.lnxne.boe Ready worker 4h41m v1.20.0+df9c838 # oc get po -n openshift-storage -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-5tvhn 3/3 Running 0 3h33m 10.13.12.7 worker-1.m1312001ocs.lnxne.boe <none> <none> csi-cephfsplugin-g7lgl 3/3 Running 0 3h32m 10.13.12.8 worker-2.m1312001ocs.lnxne.boe <none> <none> csi-cephfsplugin-mgx8m 3/3 Running 0 3h33m 10.13.12.6 worker-0.m1312001ocs.lnxne.boe <none> <none> csi-cephfsplugin-provisioner-5f668cb9df-fxhzl 6/6 Running 0 3h33m 10.129.2.40 worker-1.m1312001ocs.lnxne.boe <none> <none> csi-cephfsplugin-provisioner-5f668cb9df-g7vhw 6/6 Running 0 78m 10.128.2.72 worker-2.m1312001ocs.lnxne.boe <none> <none> csi-rbdplugin-9mtwp 3/3 Running 0 3h33m 10.13.12.7 worker-1.m1312001ocs.lnxne.boe <none> <none> csi-rbdplugin-gkv6k 3/3 Running 0 3h33m 10.13.12.6 worker-0.m1312001ocs.lnxne.boe <none> <none> csi-rbdplugin-lbmbq 3/3 Running 0 3h32m 10.13.12.8 worker-2.m1312001ocs.lnxne.boe <none> <none> csi-rbdplugin-provisioner-846f7dddd4-gcrlb 6/6 Running 0 3h33m 10.128.2.47 worker-2.m1312001ocs.lnxne.boe <none> <none> csi-rbdplugin-provisioner-846f7dddd4-gdw29 6/6 Running 0 78m 10.129.2.47 worker-1.m1312001ocs.lnxne.boe <none> <none> must-gather-wmtmx-helper 1/1 Running 0 2m48s 10.128.2.108 worker-2.m1312001ocs.lnxne.boe <none> <none> noobaa-core-0 1/1 Running 0 3h32m 10.128.2.49 worker-2.m1312001ocs.lnxne.boe <none> <none> noobaa-db-pg-0 1/1 Terminating 0 3h32m 10.131.0.40 worker-0.m1312001ocs.lnxne.boe <none> <none> noobaa-endpoint-5d6cbb98b5-lz4mz 1/1 Running 0 3h32m 10.129.2.41 worker-1.m1312001ocs.lnxne.boe <none> <none> noobaa-operator-b5fddbbd-fpmmx 1/1 Running 0 3h34m 10.129.2.38 worker-1.m1312001ocs.lnxne.boe <none> <none> ocs-metrics-exporter-795756bbc7-94bbr 1/1 Terminating 0 3h34m 10.131.0.36 worker-0.m1312001ocs.lnxne.boe <none> <none> ocs-metrics-exporter-795756bbc7-hz7z7 1/1 Running 0 78m 10.128.2.73 worker-2.m1312001ocs.lnxne.boe <none> <none> ocs-operator-8cd4ff878-sfb4g 1/1 Running 0 3h34m 10.128.2.44 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-5cvmxrt 1/1 Running 0 3h32m 10.129.2.43 worker-1.m1312001ocs.lnxne.boe <none> <none> rook-ceph-crashcollector-worker-2.m1312001ocs.lnxne.boe-78pv8xc 1/1 Running 0 3h32m 10.128.2.51 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6c584549hg267 2/2 Running 0 3h32m 10.128.2.50 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c7b66db56dcn 2/2 Running 0 3h31m 10.129.2.45 worker-1.m1312001ocs.lnxne.boe <none> <none> rook-ceph-mgr-a-d5986d946-mslpz 2/2 Running 0 78m 10.128.2.69 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-mon-a-96fd4f84f-tgwlh 2/2 Running 0 3h32m 10.129.2.42 worker-1.m1312001ocs.lnxne.boe <none> <none> rook-ceph-mon-c-85d9f89d4-b9l9c 2/2 Running 0 3h31m 10.128.2.52 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-mon-d-canary-8444545c5f-w95ns 0/2 Pending 0 23s <none> <none> <none> <none> rook-ceph-operator-6697f9489f-4rtsr 1/1 Running 0 78m 10.128.2.70 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-osd-0-759985f9b-99c5c 0/2 Pending 0 78m <none> <none> <none> <none> rook-ceph-osd-1-654584ff5-zpgwb 2/2 Running 0 3h30m 10.128.2.53 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-osd-2-664b5bc786-k49nw 2/2 Running 0 3h19m 10.129.2.46 worker-1.m1312001ocs.lnxne.boe <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-0wvsc24n 0/1 Completed 0 4h13m 10.128.2.32 worker-2.m1312001ocs.lnxne.boe <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-1jfclj55 0/1 Completed 0 4h13m 10.129.2.22 worker-1.m1312001ocs.lnxne.boe <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c45c58kdqg6 2/2 Running 5 3h31m 10.129.2.44 worker-1.m1312001ocs.lnxne.boe <none> <none> rook-ceph-tools-76dbc6f57f-bj9dx 1/1 Terminating 0 3h33m 10.13.12.6 worker-0.m1312001ocs.lnxne.boe <none> <none> rook-ceph-tools-76dbc6f57f-m2kft 1/1 Running 0 78m 10.13.12.8 worker-2.m1312001ocs.lnxne.boe <none> <none> worker-1m1312001ocslnxneboe-debug 1/1 Running 0 2m47s 10.13.12.7 worker-1.m1312001ocs.lnxne.boe <none> <none> worker-2m1312001ocslnxneboe-debug 1/1 Running 0 2m47s 10.13.12.8 worker-2.m1312001ocs.lnxne.boe <none> <none> Updating the must-gather logs and the ocs-ci upgrade test logs (upgrade_4_7_1_410_rc2_logs.zip). Created attachment 1789382 [details]
must_gather_and_upgrade_logs
Sravika, a few questions: - This sounds very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1964958. Please take a look at that BZ for comparison. - Has anything changed in the test configuration since upgrade tests were run in the past? - How often does this repro in the automation? Every time during the upgrade? Or it has only been seen this time? - Does this issue repro during a manual upgrade? If a node goes to not ready state it is commonly due to some resource issue, but it ultimately out of the control of OCS. We would need the OCP team to look at this, which will take some time. If a manual upgrade does not repro this issue, it shouldn't block the 4.7.1 release. @Travis, I ran upgrade(pre_upgrade, ocs_upgrade, post_upgrade) couple of time with ocs-ci and the worker went to "Not Ready" state each time. However, I observed that this is happening during the post_upgrade tests, specifically during the test "tests/ecosystem/upgrade/test_resources.py::test_pod_io" which performs IO operations on the pods. Post Upgrade Test: tests/ecosystem/upgrade/test_resources.py::test_pod_io Test Description : Test IO on multiple pods at the same time and finish IO on pods that were created before upgrade. But the ocs_upgrade itself seems to have succeeded from the output below. # oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-local-storage local-storage-operator.4.7.0-202105210300.p0 Local Storage 4.7.0-202105210300.p0 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.17.0 Succeeded openshift-storage ocs-operator.v4.7.1-410.ci OpenShift Container Storage 4.7.1-410.ci ocs-operator.v4.7.0 Succeeded Is there anything specific test which we can perform to ensure that the upgrade to ocs 4.7.1 is successful? Also I have tried the manual upgrade by adding 4.7.1 ocs-operator via catalogsource but the update has not started (tried both Automatic and Manual subscription), please let me know if am missing something here. I followed the following document for the upgrade procedure. https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.7/html/updating_openshift_container_storage/updating-openshift-container-storage-in-internal-mode_rhocs Good to hear the upgrade actually succeeded. What IO operations are being performed in that test? If there is too much IO it could raise the load on the machine and thus cause it to become unresponsive. Is this the same IO load that has been used in past upgrade tests? If so, we should understand what changed in 4.7.1 that could cause this. But if the test changed or was just added, sounds like the IO load should be reduced. Thanks for checking, so this is not an upgrade issue which make it a duplicate of BZ #1945016 *** This bug has been marked as a duplicate of bug 1945016 *** For system P, we upgraded from 4.7 to 4.7.1 and we are seeing pods in openshift-storage restarting: [root@nx121-ahv cron]# oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-29bcw 3/3 Running 0 13h csi-cephfsplugin-9sr9x 3/3 Running 0 13h csi-cephfsplugin-dnzmv 3/3 Running 0 13h csi-cephfsplugin-provisioner-5f668cb9df-njl7l 6/6 Running 0 11h csi-cephfsplugin-provisioner-5f668cb9df-rpll6 6/6 Running 1 11h csi-rbdplugin-d9c7g 3/3 Running 0 13h csi-rbdplugin-provisioner-846f7dddd4-g9wtg 6/6 Running 0 11h csi-rbdplugin-provisioner-846f7dddd4-qhtkb 6/6 Running 1 11h csi-rbdplugin-vx5vd 3/3 Running 0 13h csi-rbdplugin-wvzkr 3/3 Running 0 13h noobaa-core-0 1/1 Running 0 13h noobaa-db-pg-0 1/1 Running 0 10h noobaa-endpoint-78db99447c-wk24v 1/1 Running 1 13h noobaa-operator-85846f6c99-v8bx6 1/1 Running 0 11h ocs-metrics-exporter-749dbc674-rjsj9 1/1 Running 0 11h ocs-operator-84c759dff-qzthj 1/1 Running 1 11h rook-ceph-crashcollector-worker-0-74d44fdf57-nrqsg 1/1 Running 0 11h rook-ceph-crashcollector-worker-1-6ff74969c6-ns5g8 1/1 Running 0 9h rook-ceph-crashcollector-worker-2-7c966f66c5-bxgjc 1/1 Running 0 13h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7fc7b5d5ff4k6 2/2 Running 25 11h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-85b5bddf64srb 2/2 Running 0 11h rook-ceph-mgr-a-78c4cfb858-rvrfw 2/2 Running 0 11h rook-ceph-mon-a-7cc4bdb45c-tkh9l 2/2 Running 1 14h rook-ceph-mon-b-6ddfddf969-cb8wb 2/2 Running 0 9h rook-ceph-mon-c-6c8f5cd67c-pq2dm 2/2 Running 0 11h rook-ceph-operator-69d699b578-zf5kb 1/1 Running 0 11h rook-ceph-osd-0-6f49f9dc6f-szvv5 2/2 Running 52 13h rook-ceph-osd-1-664d79cdf5-2ttcz 2/2 Running 0 11h rook-ceph-osd-2-89fdb689f-5swnv 2/2 Running 1 9h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7d7f44dl8686 2/2 Running 76 11h rook-ceph-tools-76dbc6f57f-hrzqg 1/1 Running 0 11h [root@nx121-ahv cron]# Ceph health: [root@nx121-ahv cron]# oc rsh -n openshift-storage rook-ceph-tools-76dbc6f57f-hrzqg ceph -s cluster: id: c87b3325-5b8f-4d3a-a91f-68656bb72661 health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 9h) mgr: a(active, since 10h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 9h), 3 in (since 15h) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 272 pgs objects: 17.26k objects, 66 GiB usage: 200 GiB used, 1.3 TiB / 1.5 TiB avail pgs: 272 active+clean io: client: 853 B/s rd, 7.0 KiB/s wr, 1 op/s rd, 0 op/s wr [root@nx121-ahv cron]# @Travis/Mudit: One more point to observe is that even thought the upgrade itself went fine , the upgrade testcase failed with the following error: 14:52:58 - MainThread - ocs_ci.ocs.ocs_upgrade - INFO - Current OCS subscription source: redhat-operators 14:52:58 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage patch subscription ocs-operator -n openshift-storage --type merge -p '{"spec":{"channel": "stable-4.7", "source": "ocs-catalogsource"}}' 14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan -n openshift-storage -o yaml 14:52:59 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage Traceback (most recent call last): File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__ yield self.func(*self.func_args, **self.func_kwargs) File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve f"No install plan for approve found in namespace {namespace}" ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage 14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration 14:53:03 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail 14:53:09 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan -n openshift-storage -o yaml 14:53:09 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail 14:53:10 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage Traceback (most recent call last): File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__ yield self.func(*self.func_args, **self.func_kwargs) File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve f"No install plan for approve found in namespace {namespace}" ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage 14:53:10 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration 14:53:16 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail 14:53:20 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan -n openshift-storage -o yaml 14:53:21 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage @Travis/Mudit: One more point to observe is that even thought the upgrade itself went fine , the upgrade testcase failed with the following error. 14:52:58 - MainThread - ocs_ci.ocs.ocs_upgrade - INFO - Current OCS subscription source: redhat-operators 14:52:58 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage patch subscription ocs-operator -n openshift-storage --type merge -p '{"spec":{"channel": "stable-4.7", "source": "ocs-catalogsource"}}' 14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan -n openshift-storage -o yaml 14:52:59 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage Traceback (most recent call last): File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__ yield self.func(*self.func_args, **self.func_kwargs) File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve f"No install plan for approve found in namespace {namespace}" ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage 14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration 14:53:03 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail 14:53:09 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan -n openshift-storage -o yaml 14:53:09 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail 14:53:10 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage Traceback (most recent call last): File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__ yield self.func(*self.func_args, **self.func_kwargs) File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve f"No install plan for approve found in namespace {namespace}" ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage 14:53:10 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration 14:53:16 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail 14:53:20 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan -n openshift-storage -o yaml 14:53:21 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage @Travis/Mudit: Sorry, please ignore my previous comment, the upgrade is still failing on Z. Could you please not close and reopen the bug? I have changed the subscription mode of ocs 4.7.0 to "Automatic" which does not require install plan approval as in case of "Manual" subscription and ran the ocsci upgrade tests. The upgrade test failed during the noobaa pods upgrade and also the worker went to "Not Ready" state during the upgrade itself. 07:55:50 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod -n openshift-storage -o yaml 07:55:55 - MainThread - ocs_ci.ocs.resources.pod - INFO - Found 4 pod(s) for selector: app=noobaa 07:55:55 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Number of found pods 4 is not as expected: 5 07:55:55 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage -o yaml 07:55:55 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/ocs4/mcg-core-rhel8@sha256:1496a3e823db8536380e01c58e39670e9fa2cc3d15229b2edc300acc56282c8c'} weren't upgraded in: noobaa-core-0! I am attaching the logs of the upgrade (upgrade_4_7_1_410_rc2) Created attachment 1790165 [details]
upgrade_4_7_1_410_rc2
Reopening because of the upgrade failure in https://bugzilla.redhat.com/show_bug.cgi?id=1969309#c12 So, there are two issues, 1) upgrade failure - tracked in this BZ 2) worker nodes going down - tracked in BZ #1945016 Created attachment 1790170 [details]
upgrade_4_7_1_410_rc2
Sravika, Can we please have must-gather also? I see a lot of errors in the upgrade log, difficult to say why exactly the upgrade failed without the must-gather logs. Removing 4.8? as the upgrade is from 4.7 to 4.7.1, will retarget if this is relevant for 4.8 also. @Mudit, Must-gather is already uploaded and is part of the upgrade_4_7_1_410_rc2.zip Thanks Sravika, got it now. Nimrod, can you please take a look. Looks like Noobaa pods were not upgraded and hence the ci failed. Please ignore the initial BZ discussion while looking, just check from https://bugzilla.redhat.com/show_bug.cgi?id=1969309#c12 There is a different issue for nodes going down and I checked in this case all nodes are UP. Please re assign if this is not related to noobaa. @Mudit, I have re-executed the test on Friday and this time the upgrade proceeded further and the pods got upgraded, however the ceph health recovery did not happen in time which failed the upgrade test. The ceph health recovery happens after sometime and the slow performance of ceph health seems to be due to an environment issue reported in #1952514. I will rerun this on a cluster which is on another Cec. One additional observation is that the ceph-rgw pod restarted 10 times during the upgrade. I am uploading the update logs and must-gather logs (upgrade_4_7_1_410_rc2_new.zip) # oc get po -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-bxvbj 3/3 Running 0 2d17h csi-cephfsplugin-dnx6h 3/3 Running 0 2d17h csi-cephfsplugin-pcvn7 3/3 Running 0 2d16h csi-cephfsplugin-provisioner-5f668cb9df-n24hv 6/6 Running 0 2d17h csi-cephfsplugin-provisioner-5f668cb9df-nb6l6 6/6 Running 0 2d17h csi-rbdplugin-5jvxn 3/3 Running 0 2d17h csi-rbdplugin-k8z5f 3/3 Running 0 2d17h csi-rbdplugin-mcggm 3/3 Running 0 2d16h csi-rbdplugin-provisioner-846f7dddd4-66hxz 6/6 Running 0 2d17h csi-rbdplugin-provisioner-846f7dddd4-hklpz 6/6 Running 0 2d17h noobaa-core-0 1/1 Running 0 2d16h noobaa-db-pg-0 1/1 Running 0 2d16h noobaa-endpoint-5478c6787c-zh4m6 1/1 Running 0 2d17h noobaa-operator-6d768f64c6-95vd8 1/1 Running 0 2d17h ocs-metrics-exporter-576564488f-9vzs8 1/1 Running 0 2d17h ocs-operator-85bc6445fd-hlcmq 1/1 Running 0 2d17h rook-ceph-crashcollector-worker-0.m1312001ocs.lnxne.boe-66dq9fl 1/1 Running 0 2d16h rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-5cgnk4l 1/1 Running 0 2d16h rook-ceph-crashcollector-worker-2.m1312001ocs.lnxne.boe-78fn7l7 1/1 Running 0 2d16h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6d7876c7pzs9s 2/2 Running 0 2d16h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-8555c99dwgwrf 2/2 Running 0 2d16h rook-ceph-mgr-a-747fb898db-pphns 2/2 Running 0 2d16h rook-ceph-mon-a-8d7887686-7bz6q 2/2 Running 0 2d16h rook-ceph-mon-b-7786665747-z7jss 2/2 Running 0 2d16h rook-ceph-mon-c-785d4f9b7f-xt592 2/2 Running 0 2d16h rook-ceph-operator-5d99cd5d7-d7nkd 1/1 Running 0 2d17h rook-ceph-osd-0-6b56bc4cb8-jvlwg 2/2 Running 0 2d16h rook-ceph-osd-1-658fbfb747-pt57v 2/2 Running 0 2d16h rook-ceph-osd-2-5665989b75-c54rd 2/2 Running 0 2d16h rook-ceph-osd-prepare-ocs-deviceset-localblocksc-0-data-04smzvx 0/1 Completed 0 2d17h rook-ceph-osd-prepare-ocs-deviceset-localblocksc-0-data-1s6xw48 0/1 Completed 0 2d17h rook-ceph-osd-prepare-ocs-deviceset-localblocksc-0-data-2bgnp65 0/1 Completed 0 2d17h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-c4f6995kms7w 2/2 Running 10 2d16h rook-ceph-tools-76dbc6f57f-vnmlg 1/1 Running 0 2d17 Uploaded the latest logs(upgrade_4_7_1_410_rc2_new.zip) in the google drive as the size is larger than allowed. https://drive.google.com/file/d/1wflMngza3EQDJxNB4G_E0WYEwWE23HVy/view?usp=sharing Moving it back to Noobaa to see what happened in https://bugzilla.redhat.com/show_bug.cgi?id=1969309#c19 Other instances (including the ceph health) are not reliable enough as of now because we suspect those either because of worker nodes being down or environment issue. In the same environment I have run the "ocs_upgrade" test in isolation excluding "pre_upgrade" and "post_upgrade" tests and the upgrade test PASSED without any timeouts during the ceph health recovery. Also there are no pod restarts observed during the "ocs_upgrade". The "pre_upgrade" disruptive test (tests/manage/monitoring/test_workload_with_distruptions.py::test_workload_with_checksum tests/manage/z_cluster/cluster_expansion/) is causing the worker node to go into "Not Ready" state (BZ #1945016) which resulted in a upgrade failure as the pods are in "Terminating" state. Uploading both the logs with and without pre_upgrade tests (upgrade_4_7_1_410_rc2_jun14.zip). Created attachment 1791238 [details]
upgrade_4_7_1_410_rc2_jun14
This is not seen with the latest builds. Closing for now, please reopen if this reproducible. |
Created attachment 1789331 [details] ocs-ci_upgrade _log Description of problem (please be detailed as possible and provide log snippests): During ocs-ci upgrade test( 4.7.0 - 4.7.1-403.ci), upgrade failed, workers went to "Not Ready" state and multiple pods crashed during the upgrade test leaving the storage cluster broken. The ocs-catalogsource was in state "TRANSIENT_FAILURE!" and did not come to "READY" state during the upgrade procedure of ocs cluster. Version of all relevant components (if applicable): OCP: 4.7.13 Local Storage : 4.7.0-202105210300.p0 OCS: 4.7.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes upgrade to 4.7.1 fails Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCP, OCS using local storage operator 2. Approval strategy during installation of OCS should be manual. 3. Upgrade ocs from 4.7.0 to 4.7.1 using ocs-ci run-ci -m 'pre_upgrade or ocs_upgrade or post_upgrade' --ocs-version 4.7 --upgrade-ocs-version 4.7.1 --upgrade-ocs-registry-image 'quay.io/rhceph-dev/ocs-registry:latest-stable-4.7.1' --ocsci-conf config.yaml --ocsci-conf conf/ocsci/manual_subscription_plan_approval.yaml --cluster-path <cluster_path> Actual results: Upgrade fails leaving the cluster broken and worker nodes "Not Ready" # oc get nodes NAME STATUS ROLES AGE VERSION master-0.m1312001ocs.lnxne.boe Ready master 18h v1.20.0+df9c838 master-1.m1312001ocs.lnxne.boe Ready master 18h v1.20.0+df9c838 master-2.m1312001ocs.lnxne.boe Ready master 18h v1.20.0+df9c838 worker-0.m1312001ocs.lnxne.boe Ready worker 18h v1.20.0+df9c838 worker-1.m1312001ocs.lnxne.boe NotReady worker 18h v1.20.0+df9c838 worker-2.m1312001ocs.lnxne.boe NotReady worker 18h v1.20.0+df9c838 NAME READY STATUS RESTARTS AGE csi-cephfsplugin-72fbw 3/3 Running 0 17h csi-cephfsplugin-bj4gq 3/3 Running 0 17h csi-cephfsplugin-d5sqx 3/3 Running 0 17h csi-cephfsplugin-provisioner-6878df594-4hd4h 0/6 Pending 0 14h csi-cephfsplugin-provisioner-6878df594-7pk9x 6/6 Running 0 17h csi-cephfsplugin-provisioner-6878df594-s85jt 6/6 Terminating 0 17h csi-rbdplugin-28hz7 3/3 Running 0 17h csi-rbdplugin-provisioner-85f54d8949-fdgnw 6/6 Running 0 15h csi-rbdplugin-provisioner-85f54d8949-rkmsp 0/6 Pending 0 14h csi-rbdplugin-provisioner-85f54d8949-wvtmt 6/6 Terminating 0 17h csi-rbdplugin-sk9j7 3/3 Running 0 17h csi-rbdplugin-v9n2c 3/3 Running 0 17h must-gather-k424v-helper 1/1 Running 0 23m noobaa-core-0 1/1 Terminating 0 17h noobaa-db-pg-0 1/1 Terminating 0 17h noobaa-endpoint-d4b7854cc-gtvsg 1/1 Running 0 14h noobaa-endpoint-d4b7854cc-xg4m6 1/1 Terminating 0 17h noobaa-operator-7f54bd479b-mdljh 1/1 Running 0 18h ocs-metrics-exporter-fbbf75785-gtg6s 1/1 Terminating 0 18h ocs-metrics-exporter-fbbf75785-x7fcb 1/1 Running 0 15h ocs-operator-9876977fb-777sb 0/1 Running 0 18h ocs-osd-removal-job-mq5ww 0/1 Completed 0 15h rook-ceph-crashcollector-worker-0.m1312001ocs.lnxne.boe-786jztg 1/1 Running 0 17h rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-76frvl9 0/1 Pending 0 14h rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-76kn7s6 1/1 Terminating 0 17h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79bdf67d7fxks 2/2 Terminating 0 17h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79bdf67dzr6xl 0/2 Pending 0 14h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6fbd9494x8mbd 2/2 Running 0 17h rook-ceph-mgr-a-99fddfdd6-kf9xg 2/2 Running 0 17h rook-ceph-mon-b-548fbd4dcc-bpvkg 2/2 Running 12 17h rook-ceph-mon-c-5bf994d655-7vllr 0/2 Pending 0 14h rook-ceph-mon-c-5bf994d655-dn2dw 2/2 Terminating 0 17h rook-ceph-operator-6ff459dd8-dx2wk 1/1 Terminating 0 15h rook-ceph-operator-6ff459dd8-qtcl4 1/1 Running 0 14h rook-ceph-osd-0-5c9996fbdf-78jph 0/2 Pending 0 14h rook-ceph-osd-0-5c9996fbdf-zhhmb 2/2 Terminating 0 17h rook-ceph-osd-1-8d8bb79fc-mz75n 2/2 Running 0 15h rook-ceph-osd-2-bcc659b64-rbtfp 2/2 Running 211 17h rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-04kmp2h9 0/1 Completed 0 17h rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-37xhwwxv 0/1 Completed 0 15h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5dd99c4bwbl2 2/2 Terminating 0 15h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5dd99c4x64jw 2/2 Running 290 14h rook-ceph-tools-67ddbbccbc-ns5p5 1/1 Terminating 0 17h rook-ceph-tools-67ddbbccbc-sgldf 1/1 Running 0 15h worker-0m1312001ocslnxneboe-debug 1/1 Running 0 23m Expected results: Upgrade from 4.7.0 to 4.7.1 should succeed. Additional info: Worker node Specs: 3 worker nodes 64GB memory, 16 cores, 1 TB disk . Must gather logs collection hangs as the cluster is broken, attaching the logs of the ocs-ci upgrade test