Bug 1969309 - [IBM Z] : Upgrade from 4.7 4.7.1 fails during the ocs-ci upgrade test
Summary: [IBM Z] : Upgrade from 4.7 4.7.1 fails during the ocs-ci upgrade test
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: Multi-Cloud Object Gateway
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Nimrod Becker
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-08 07:41 UTC by Sravika
Modified: 2021-09-07 13:54 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-07 13:54:57 UTC
Embargoed:


Attachments (Terms of Use)
ocs-ci_upgrade _log (659.56 KB, text/plain)
2021-06-08 07:41 UTC, Sravika
no flags Details
ocs-ci_upgrade__test_html (73.96 KB, text/html)
2021-06-08 07:43 UTC, Sravika
no flags Details
must_gather_and_upgrade_logs (8.05 MB, application/zip)
2021-06-08 13:32 UTC, Sravika
no flags Details
upgrade_4_7_1_410_rc2 (37.48 KB, application/zip)
2021-06-11 11:07 UTC, Sravika
no flags Details
upgrade_4_7_1_410_rc2 (13.40 MB, application/zip)
2021-06-11 11:40 UTC, Sravika
no flags Details
upgrade_4_7_1_410_rc2_jun14 (86.43 KB, application/zip)
2021-06-15 11:36 UTC, Sravika
no flags Details

Description Sravika 2021-06-08 07:41:54 UTC
Created attachment 1789331 [details]
ocs-ci_upgrade _log

Description of problem (please be detailed as possible and provide log
snippests):

During ocs-ci upgrade test( 4.7.0 - 4.7.1-403.ci), upgrade failed, workers went to "Not Ready" state and multiple pods crashed during the upgrade test leaving the storage cluster broken. The  ocs-catalogsource was in state "TRANSIENT_FAILURE!" and did not come to "READY" state during the upgrade procedure of ocs cluster.


Version of all relevant components (if applicable):

OCP: 4.7.13
Local Storage : 4.7.0-202105210300.p0   
OCS: 4.7.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes upgrade to 4.7.1 fails

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP, OCS using local storage operator 
2. Approval strategy during installation of OCS should be manual.
3. Upgrade ocs from 4.7.0 to 4.7.1 using ocs-ci

run-ci -m 'pre_upgrade or ocs_upgrade or post_upgrade' --ocs-version 4.7 --upgrade-ocs-version 4.7.1 --upgrade-ocs-registry-image 'quay.io/rhceph-dev/ocs-registry:latest-stable-4.7.1' --ocsci-conf config.yaml --ocsci-conf conf/ocsci/manual_subscription_plan_approval.yaml --cluster-path <cluster_path>



Actual results:
Upgrade fails leaving the cluster broken and worker nodes "Not Ready"

# oc get nodes
NAME                             STATUS     ROLES    AGE   VERSION
master-0.m1312001ocs.lnxne.boe   Ready      master   18h   v1.20.0+df9c838
master-1.m1312001ocs.lnxne.boe   Ready      master   18h   v1.20.0+df9c838
master-2.m1312001ocs.lnxne.boe   Ready      master   18h   v1.20.0+df9c838
worker-0.m1312001ocs.lnxne.boe   Ready      worker   18h   v1.20.0+df9c838
worker-1.m1312001ocs.lnxne.boe   NotReady   worker   18h   v1.20.0+df9c838
worker-2.m1312001ocs.lnxne.boe   NotReady   worker   18h   v1.20.0+df9c838

NAME                                                              READY   STATUS        RESTARTS   AGE
csi-cephfsplugin-72fbw                                            3/3     Running       0          17h
csi-cephfsplugin-bj4gq                                            3/3     Running       0          17h
csi-cephfsplugin-d5sqx                                            3/3     Running       0          17h
csi-cephfsplugin-provisioner-6878df594-4hd4h                      0/6     Pending       0          14h
csi-cephfsplugin-provisioner-6878df594-7pk9x                      6/6     Running       0          17h
csi-cephfsplugin-provisioner-6878df594-s85jt                      6/6     Terminating   0          17h
csi-rbdplugin-28hz7                                               3/3     Running       0          17h
csi-rbdplugin-provisioner-85f54d8949-fdgnw                        6/6     Running       0          15h
csi-rbdplugin-provisioner-85f54d8949-rkmsp                        0/6     Pending       0          14h
csi-rbdplugin-provisioner-85f54d8949-wvtmt                        6/6     Terminating   0          17h
csi-rbdplugin-sk9j7                                               3/3     Running       0          17h
csi-rbdplugin-v9n2c                                               3/3     Running       0          17h
must-gather-k424v-helper                                          1/1     Running       0          23m
noobaa-core-0                                                     1/1     Terminating   0          17h
noobaa-db-pg-0                                                    1/1     Terminating   0          17h
noobaa-endpoint-d4b7854cc-gtvsg                                   1/1     Running       0          14h
noobaa-endpoint-d4b7854cc-xg4m6                                   1/1     Terminating   0          17h
noobaa-operator-7f54bd479b-mdljh                                  1/1     Running       0          18h
ocs-metrics-exporter-fbbf75785-gtg6s                              1/1     Terminating   0          18h
ocs-metrics-exporter-fbbf75785-x7fcb                              1/1     Running       0          15h
ocs-operator-9876977fb-777sb                                      0/1     Running       0          18h
ocs-osd-removal-job-mq5ww                                         0/1     Completed     0          15h
rook-ceph-crashcollector-worker-0.m1312001ocs.lnxne.boe-786jztg   1/1     Running       0          17h
rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-76frvl9   0/1     Pending       0          14h
rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-76kn7s6   1/1     Terminating   0          17h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79bdf67d7fxks   2/2     Terminating   0          17h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79bdf67dzr6xl   0/2     Pending       0          14h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6fbd9494x8mbd   2/2     Running       0          17h
rook-ceph-mgr-a-99fddfdd6-kf9xg                                   2/2     Running       0          17h
rook-ceph-mon-b-548fbd4dcc-bpvkg                                  2/2     Running       12         17h
rook-ceph-mon-c-5bf994d655-7vllr                                  0/2     Pending       0          14h
rook-ceph-mon-c-5bf994d655-dn2dw                                  2/2     Terminating   0          17h
rook-ceph-operator-6ff459dd8-dx2wk                                1/1     Terminating   0          15h
rook-ceph-operator-6ff459dd8-qtcl4                                1/1     Running       0          14h
rook-ceph-osd-0-5c9996fbdf-78jph                                  0/2     Pending       0          14h
rook-ceph-osd-0-5c9996fbdf-zhhmb                                  2/2     Terminating   0          17h
rook-ceph-osd-1-8d8bb79fc-mz75n                                   2/2     Running       0          15h
rook-ceph-osd-2-bcc659b64-rbtfp                                   2/2     Running       211        17h
rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-04kmp2h9   0/1     Completed     0          17h
rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-37xhwwxv   0/1     Completed     0          15h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5dd99c4bwbl2   2/2     Terminating   0          15h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5dd99c4x64jw   2/2     Running       290        14h
rook-ceph-tools-67ddbbccbc-ns5p5                                  1/1     Terminating   0          17h
rook-ceph-tools-67ddbbccbc-sgldf                                  1/1     Running       0          15h
worker-0m1312001ocslnxneboe-debug                                 1/1     Running       0          23m

Expected results:

Upgrade from 4.7.0 to 4.7.1 should succeed.

Additional info:

Worker node Specs: 
3 worker nodes 
64GB memory, 16 cores, 1 TB disk .

Must gather logs collection hangs as the cluster is broken, attaching the logs of the ocs-ci upgrade test

Comment 2 Sravika 2021-06-08 07:43:26 UTC
Created attachment 1789332 [details]
ocs-ci_upgrade__test_html

Comment 3 Sravika 2021-06-08 13:28:50 UTC
Upgraded ocs 4.7.0 to latest rc2, 4.7.1-410.ci.  The ocs-operator got updated from 4.7.0 to 4.7.1.
However, during the upgrade one of the workers went to “Not Ready” state leaving the pods on that worker in Terminating/Pending state.
Also the ceph-rgw (rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c45c58kdqg6) pod has been restarted several times.

]# oc get csv -A
NAMESPACE                              NAME                                           DISPLAY                       VERSION                 REPLACES              PHASE
openshift-local-storage                local-storage-operator.4.7.0-202105210300.p0   Local Storage                 4.7.0-202105210300.p0                         Succeeded
openshift-operator-lifecycle-manager   packageserver                                  Package Server                0.17.0                                        Succeeded
openshift-storage                      ocs-operator.v4.7.1-410.ci                     OpenShift Container Storage   4.7.1-410.ci            ocs-operator.v4.7.0   Succeeded


# oc get nodes
NAME                             STATUS     ROLES    AGE     VERSION
master-0.m1312001ocs.lnxne.boe   Ready      master   4h51m   v1.20.0+df9c838
master-1.m1312001ocs.lnxne.boe   Ready      master   4h50m   v1.20.0+df9c838
master-2.m1312001ocs.lnxne.boe   Ready      master   4h51m   v1.20.0+df9c838
worker-0.m1312001ocs.lnxne.boe   NotReady   worker   4h41m   v1.20.0+df9c838
worker-1.m1312001ocs.lnxne.boe   Ready      worker   4h40m   v1.20.0+df9c838
worker-2.m1312001ocs.lnxne.boe   Ready      worker   4h41m   v1.20.0+df9c838


# oc get po -n openshift-storage -owide
NAME                                                              READY   STATUS        RESTARTS   AGE     IP             NODE                             NOMINATED NODE   READINESS GATES
csi-cephfsplugin-5tvhn                                            3/3     Running       0          3h33m   10.13.12.7     worker-1.m1312001ocs.lnxne.boe   <none>           <none>
csi-cephfsplugin-g7lgl                                            3/3     Running       0          3h32m   10.13.12.8     worker-2.m1312001ocs.lnxne.boe   <none>           <none>
csi-cephfsplugin-mgx8m                                            3/3     Running       0          3h33m   10.13.12.6     worker-0.m1312001ocs.lnxne.boe   <none>           <none>
csi-cephfsplugin-provisioner-5f668cb9df-fxhzl                     6/6     Running       0          3h33m   10.129.2.40    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
csi-cephfsplugin-provisioner-5f668cb9df-g7vhw                     6/6     Running       0          78m     10.128.2.72    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
csi-rbdplugin-9mtwp                                               3/3     Running       0          3h33m   10.13.12.7     worker-1.m1312001ocs.lnxne.boe   <none>           <none>
csi-rbdplugin-gkv6k                                               3/3     Running       0          3h33m   10.13.12.6     worker-0.m1312001ocs.lnxne.boe   <none>           <none>
csi-rbdplugin-lbmbq                                               3/3     Running       0          3h32m   10.13.12.8     worker-2.m1312001ocs.lnxne.boe   <none>           <none>
csi-rbdplugin-provisioner-846f7dddd4-gcrlb                        6/6     Running       0          3h33m   10.128.2.47    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
csi-rbdplugin-provisioner-846f7dddd4-gdw29                        6/6     Running       0          78m     10.129.2.47    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
must-gather-wmtmx-helper                                          1/1     Running       0          2m48s   10.128.2.108   worker-2.m1312001ocs.lnxne.boe   <none>           <none>
noobaa-core-0                                                     1/1     Running       0          3h32m   10.128.2.49    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
noobaa-db-pg-0                                                    1/1     Terminating   0          3h32m   10.131.0.40    worker-0.m1312001ocs.lnxne.boe   <none>           <none>
noobaa-endpoint-5d6cbb98b5-lz4mz                                  1/1     Running       0          3h32m   10.129.2.41    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
noobaa-operator-b5fddbbd-fpmmx                                    1/1     Running       0          3h34m   10.129.2.38    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
ocs-metrics-exporter-795756bbc7-94bbr                             1/1     Terminating   0          3h34m   10.131.0.36    worker-0.m1312001ocs.lnxne.boe   <none>           <none>
ocs-metrics-exporter-795756bbc7-hz7z7                             1/1     Running       0          78m     10.128.2.73    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
ocs-operator-8cd4ff878-sfb4g                                      1/1     Running       0          3h34m   10.128.2.44    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-5cvmxrt   1/1     Running       0          3h32m   10.129.2.43    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-crashcollector-worker-2.m1312001ocs.lnxne.boe-78pv8xc   1/1     Running       0          3h32m   10.128.2.51    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6c584549hg267   2/2     Running       0          3h32m   10.128.2.50    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c7b66db56dcn   2/2     Running       0          3h31m   10.129.2.45    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-mgr-a-d5986d946-mslpz                                   2/2     Running       0          78m     10.128.2.69    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-mon-a-96fd4f84f-tgwlh                                   2/2     Running       0          3h32m   10.129.2.42    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-mon-c-85d9f89d4-b9l9c                                   2/2     Running       0          3h31m   10.128.2.52    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-mon-d-canary-8444545c5f-w95ns                           0/2     Pending       0          23s     <none>         <none>                           <none>           <none>
rook-ceph-operator-6697f9489f-4rtsr                               1/1     Running       0          78m     10.128.2.70    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-osd-0-759985f9b-99c5c                                   0/2     Pending       0          78m     <none>         <none>                           <none>           <none>
rook-ceph-osd-1-654584ff5-zpgwb                                   2/2     Running       0          3h30m   10.128.2.53    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-osd-2-664b5bc786-k49nw                                  2/2     Running       0          3h19m   10.129.2.46    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-0wvsc24n   0/1     Completed     0          4h13m   10.128.2.32    worker-2.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localdisksc-0-data-1jfclj55   0/1     Completed     0          4h13m   10.129.2.22    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6c45c58kdqg6   2/2     Running       5          3h31m   10.129.2.44    worker-1.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-tools-76dbc6f57f-bj9dx                                  1/1     Terminating   0          3h33m   10.13.12.6     worker-0.m1312001ocs.lnxne.boe   <none>           <none>
rook-ceph-tools-76dbc6f57f-m2kft                                  1/1     Running       0          78m     10.13.12.8     worker-2.m1312001ocs.lnxne.boe   <none>           <none>
worker-1m1312001ocslnxneboe-debug                                 1/1     Running       0          2m47s   10.13.12.7     worker-1.m1312001ocs.lnxne.boe   <none>           <none>
worker-2m1312001ocslnxneboe-debug                                 1/1     Running       0          2m47s   10.13.12.8     worker-2.m1312001ocs.lnxne.boe   <none>           <none>


Updating the must-gather logs and the ocs-ci upgrade test logs (upgrade_4_7_1_410_rc2_logs.zip).

Comment 4 Sravika 2021-06-08 13:32:27 UTC
Created attachment 1789382 [details]
must_gather_and_upgrade_logs

Comment 5 Travis Nielsen 2021-06-08 14:03:53 UTC
Sravika, a few questions:
- This sounds very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1964958. Please take a look at that BZ for comparison. 
- Has anything changed in the test configuration since upgrade tests were run in the past?
- How often does this repro in the automation? Every time during the upgrade? Or it has only been seen this time?
- Does this issue repro during a manual upgrade?

If a node goes to not ready state it is commonly due to some resource issue, but it ultimately out of the control of OCS. We would need the OCP team to look at this, which will take some time. If a manual upgrade does not repro this issue, it shouldn't block the 4.7.1 release.

Comment 6 Sravika 2021-06-09 16:03:22 UTC
@Travis,

I ran upgrade(pre_upgrade, ocs_upgrade, post_upgrade) couple of time with ocs-ci and the worker went to "Not Ready" state each time.
However, I observed that this is happening during the post_upgrade tests, specifically during the test "tests/ecosystem/upgrade/test_resources.py::test_pod_io" which performs IO operations on the pods.

Post Upgrade Test: tests/ecosystem/upgrade/test_resources.py::test_pod_io
Test Description :  	Test IO on multiple pods at the same time and finish IO on pods that were created before upgrade. 


But the ocs_upgrade itself seems to have succeeded from the output below.
# oc get csv -A
NAMESPACE                              NAME                                           DISPLAY                       VERSION                 REPLACES              PHASE
openshift-local-storage                local-storage-operator.4.7.0-202105210300.p0   Local Storage                 4.7.0-202105210300.p0                         Succeeded
openshift-operator-lifecycle-manager   packageserver                                  Package Server                0.17.0                                        Succeeded
openshift-storage                      ocs-operator.v4.7.1-410.ci                     OpenShift Container Storage   4.7.1-410.ci            ocs-operator.v4.7.0   Succeeded

Is there anything specific test which we can perform to ensure that the upgrade to ocs 4.7.1 is successful?

Also I have tried the manual upgrade by adding 4.7.1 ocs-operator via catalogsource but the update has not started (tried both Automatic and Manual subscription), please let me know if am missing something here. 
I followed the following document for the upgrade procedure.
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.7/html/updating_openshift_container_storage/updating-openshift-container-storage-in-internal-mode_rhocs

Comment 7 Travis Nielsen 2021-06-09 17:09:17 UTC
Good to hear the upgrade actually succeeded.

What IO operations are being performed in that test? If there is too much IO it could raise the load on the machine and thus cause it to become unresponsive. Is this the same IO load that has been used in past upgrade tests? If so, we should understand what changed in 4.7.1 that could cause this. But if the test changed or was just added, sounds like the IO load should be reduced.

Comment 8 Mudit Agarwal 2021-06-10 06:44:52 UTC
Thanks for checking, so this is not an upgrade issue which make it a duplicate of BZ #1945016

*** This bug has been marked as a duplicate of bug 1945016 ***

Comment 9 Sridhar Venkat (IBM) 2021-06-10 13:29:11 UTC
For system P, we upgraded from 4.7 to 4.7.1 and we are seeing pods in openshift-storage restarting:

[root@nx121-ahv cron]# oc get pods -n openshift-storage
NAME                                                              READY   STATUS    RESTARTS   AGE
csi-cephfsplugin-29bcw                                            3/3     Running   0          13h
csi-cephfsplugin-9sr9x                                            3/3     Running   0          13h
csi-cephfsplugin-dnzmv                                            3/3     Running   0          13h
csi-cephfsplugin-provisioner-5f668cb9df-njl7l                     6/6     Running   0          11h
csi-cephfsplugin-provisioner-5f668cb9df-rpll6                     6/6     Running   1          11h
csi-rbdplugin-d9c7g                                               3/3     Running   0          13h
csi-rbdplugin-provisioner-846f7dddd4-g9wtg                        6/6     Running   0          11h
csi-rbdplugin-provisioner-846f7dddd4-qhtkb                        6/6     Running   1          11h
csi-rbdplugin-vx5vd                                               3/3     Running   0          13h
csi-rbdplugin-wvzkr                                               3/3     Running   0          13h
noobaa-core-0                                                     1/1     Running   0          13h
noobaa-db-pg-0                                                    1/1     Running   0          10h
noobaa-endpoint-78db99447c-wk24v                                  1/1     Running   1          13h
noobaa-operator-85846f6c99-v8bx6                                  1/1     Running   0          11h
ocs-metrics-exporter-749dbc674-rjsj9                              1/1     Running   0          11h
ocs-operator-84c759dff-qzthj                                      1/1     Running   1          11h
rook-ceph-crashcollector-worker-0-74d44fdf57-nrqsg                1/1     Running   0          11h
rook-ceph-crashcollector-worker-1-6ff74969c6-ns5g8                1/1     Running   0          9h
rook-ceph-crashcollector-worker-2-7c966f66c5-bxgjc                1/1     Running   0          13h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7fc7b5d5ff4k6   2/2     Running   25         11h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-85b5bddf64srb   2/2     Running   0          11h
rook-ceph-mgr-a-78c4cfb858-rvrfw                                  2/2     Running   0          11h
rook-ceph-mon-a-7cc4bdb45c-tkh9l                                  2/2     Running   1          14h
rook-ceph-mon-b-6ddfddf969-cb8wb                                  2/2     Running   0          9h
rook-ceph-mon-c-6c8f5cd67c-pq2dm                                  2/2     Running   0          11h
rook-ceph-operator-69d699b578-zf5kb                               1/1     Running   0          11h
rook-ceph-osd-0-6f49f9dc6f-szvv5                                  2/2     Running   52         13h
rook-ceph-osd-1-664d79cdf5-2ttcz                                  2/2     Running   0          11h
rook-ceph-osd-2-89fdb689f-5swnv                                   2/2     Running   1          9h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7d7f44dl8686   2/2     Running   76         11h
rook-ceph-tools-76dbc6f57f-hrzqg                                  1/1     Running   0          11h
[root@nx121-ahv cron]# 


Ceph health:

[root@nx121-ahv cron]# oc rsh -n openshift-storage rook-ceph-tools-76dbc6f57f-hrzqg ceph -s
  cluster:
    id:     c87b3325-5b8f-4d3a-a91f-68656bb72661
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 9h)
    mgr: a(active, since 10h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 9h), 3 in (since 15h)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 272 pgs
    objects: 17.26k objects, 66 GiB
    usage:   200 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     272 active+clean
 
  io:
    client:   853 B/s rd, 7.0 KiB/s wr, 1 op/s rd, 0 op/s wr
 
[root@nx121-ahv cron]#

Comment 10 Sravika 2021-06-10 13:38:22 UTC
@Travis/Mudit: One more point to observe is that even thought the upgrade itself went fine , the upgrade testcase failed with the following error:

14:52:58 - MainThread - ocs_ci.ocs.ocs_upgrade - INFO - Current OCS subscription source: redhat-operators
14:52:58 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage patch subscription ocs-operator -n openshift-storage --type merge -p '{"spec":{"channel": "stable-4.7", "source": "ocs-catalogsource"}}'
14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan  -n openshift-storage -o yaml
14:52:59 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage
Traceback (most recent call last):
  File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__
    yield self.func(*self.func_args, **self.func_kwargs)
  File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve
    f"No install plan for approve found in namespace {namespace}"
ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage
14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration
14:53:03 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail
14:53:09 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan  -n openshift-storage -o yaml
14:53:09 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail
14:53:10 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage
Traceback (most recent call last):
  File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__
    yield self.func(*self.func_args, **self.func_kwargs)
  File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve
    f"No install plan for approve found in namespace {namespace}"
ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage
14:53:10 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration
14:53:16 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail
14:53:20 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan  -n openshift-storage -o yaml
14:53:21 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage

Comment 11 Sravika 2021-06-10 13:38:57 UTC
@Travis/Mudit: One more point to observe is that even thought the upgrade itself went fine , the upgrade testcase failed with the following error.

14:52:58 - MainThread - ocs_ci.ocs.ocs_upgrade - INFO - Current OCS subscription source: redhat-operators
14:52:58 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage patch subscription ocs-operator -n openshift-storage --type merge -p '{"spec":{"channel": "stable-4.7", "source": "ocs-catalogsource"}}'
14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan  -n openshift-storage -o yaml
14:52:59 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage
Traceback (most recent call last):
  File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__
    yield self.func(*self.func_args, **self.func_kwargs)
  File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve
    f"No install plan for approve found in namespace {namespace}"
ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage
14:52:59 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration
14:53:03 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail
14:53:09 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan  -n openshift-storage -o yaml
14:53:09 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail
14:53:10 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage
Traceback (most recent call last):
  File "/root/ocs-ci/ocs_ci/utility/utils.py", line 997, in __iter__
    yield self.func(*self.func_args, **self.func_kwargs)
  File "/root/ocs-ci/ocs_ci/ocs/resources/install_plan.py", line 72, in get_install_plans_for_approve
    f"No install plan for approve found in namespace {namespace}"
ocs_ci.ocs.exceptions.NoInstallPlanForApproveFoundException: No install plan for approve found in namespace openshift-storage
14:53:10 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 10 seconds before next iteration
14:53:16 - Thread-2 - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage rsh rook-ceph-tools-67ddbbccbc-zfh4z ceph health detail
14:53:20 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get InstallPlan  -n openshift-storage -o yaml
14:53:21 - MainThread - ocs_ci.utility.utils - ERROR - Exception raised during iteration: No install plan for approve found in namespace openshift-storage

Comment 12 Sravika 2021-06-11 11:06:26 UTC
@Travis/Mudit: Sorry, please ignore my previous comment,  the upgrade is still failing on Z. Could you please not close and reopen the bug?

I have changed the subscription mode of ocs 4.7.0 to "Automatic" which does not require install plan approval as in case of "Manual" subscription and ran the ocsci upgrade tests.
The upgrade test failed during the noobaa pods upgrade and also the worker went to "Not Ready" state during the upgrade itself. 

07:55:50 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod  -n openshift-storage -o yaml
07:55:55 - MainThread - ocs_ci.ocs.resources.pod - INFO - Found 4 pod(s) for selector: app=noobaa
07:55:55 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Number of found pods 4 is not as expected: 5
07:55:55 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod noobaa-core-0 -n openshift-storage -o yaml
07:55:55 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Images: {'registry.redhat.io/ocs4/mcg-core-rhel8@sha256:1496a3e823db8536380e01c58e39670e9fa2cc3d15229b2edc300acc56282c8c'} weren't upgraded in: noobaa-core-0!

I am attaching the logs of the upgrade (upgrade_4_7_1_410_rc2)

Comment 13 Sravika 2021-06-11 11:07:25 UTC
Created attachment 1790165 [details]
upgrade_4_7_1_410_rc2

Comment 14 Mudit Agarwal 2021-06-11 11:35:53 UTC
Reopening because of the upgrade failure in https://bugzilla.redhat.com/show_bug.cgi?id=1969309#c12

So, there are two issues, 
1) upgrade failure - tracked in this BZ
2) worker nodes going down - tracked in BZ #1945016

Comment 15 Sravika 2021-06-11 11:40:03 UTC
Created attachment 1790170 [details]
upgrade_4_7_1_410_rc2

Comment 16 Mudit Agarwal 2021-06-11 16:43:49 UTC
Sravika,
Can we please have must-gather also?

Comment 17 Mudit Agarwal 2021-06-13 05:02:25 UTC
I see a lot of errors in the upgrade log, difficult to say why exactly the upgrade failed without the must-gather logs.
Removing 4.8? as the upgrade is from 4.7 to 4.7.1, will retarget if this is relevant for 4.8 also.

Comment 18 Sravika 2021-06-14 07:30:18 UTC
@Mudit,

Must-gather is already uploaded and is part of the upgrade_4_7_1_410_rc2.zip

Comment 19 Mudit Agarwal 2021-06-14 07:59:57 UTC
Thanks Sravika, got it now.

Nimrod, can you please take a look. Looks like Noobaa pods were not upgraded and hence the ci failed.
Please ignore the initial BZ discussion while looking, just check from https://bugzilla.redhat.com/show_bug.cgi?id=1969309#c12
There is a different issue for nodes going down and I checked in this case all nodes are UP.

Please re assign if this is not related to noobaa.

Comment 20 Sravika 2021-06-14 08:04:45 UTC
@Mudit,

I have re-executed the test on Friday and this time the upgrade proceeded further and the pods got upgraded, however the ceph health recovery did not happen in time which failed the upgrade test.
The ceph health recovery happens after sometime and the slow performance of ceph health seems to be due to an environment issue reported in #1952514. I will rerun this on a cluster which is on another Cec. 

One additional observation is that the ceph-rgw pod restarted 10 times during the upgrade. I am uploading the update logs and must-gather logs (upgrade_4_7_1_410_rc2_new.zip)

# oc get po -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-bxvbj                                            3/3     Running     0          2d17h
csi-cephfsplugin-dnx6h                                            3/3     Running     0          2d17h
csi-cephfsplugin-pcvn7                                            3/3     Running     0          2d16h
csi-cephfsplugin-provisioner-5f668cb9df-n24hv                     6/6     Running     0          2d17h
csi-cephfsplugin-provisioner-5f668cb9df-nb6l6                     6/6     Running     0          2d17h
csi-rbdplugin-5jvxn                                               3/3     Running     0          2d17h
csi-rbdplugin-k8z5f                                               3/3     Running     0          2d17h
csi-rbdplugin-mcggm                                               3/3     Running     0          2d16h
csi-rbdplugin-provisioner-846f7dddd4-66hxz                        6/6     Running     0          2d17h
csi-rbdplugin-provisioner-846f7dddd4-hklpz                        6/6     Running     0          2d17h
noobaa-core-0                                                     1/1     Running     0          2d16h
noobaa-db-pg-0                                                    1/1     Running     0          2d16h
noobaa-endpoint-5478c6787c-zh4m6                                  1/1     Running     0          2d17h
noobaa-operator-6d768f64c6-95vd8                                  1/1     Running     0          2d17h
ocs-metrics-exporter-576564488f-9vzs8                             1/1     Running     0          2d17h
ocs-operator-85bc6445fd-hlcmq                                     1/1     Running     0          2d17h
rook-ceph-crashcollector-worker-0.m1312001ocs.lnxne.boe-66dq9fl   1/1     Running     0          2d16h
rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-5cgnk4l   1/1     Running     0          2d16h
rook-ceph-crashcollector-worker-2.m1312001ocs.lnxne.boe-78fn7l7   1/1     Running     0          2d16h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6d7876c7pzs9s   2/2     Running     0          2d16h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-8555c99dwgwrf   2/2     Running     0          2d16h
rook-ceph-mgr-a-747fb898db-pphns                                  2/2     Running     0          2d16h
rook-ceph-mon-a-8d7887686-7bz6q                                   2/2     Running     0          2d16h
rook-ceph-mon-b-7786665747-z7jss                                  2/2     Running     0          2d16h
rook-ceph-mon-c-785d4f9b7f-xt592                                  2/2     Running     0          2d16h
rook-ceph-operator-5d99cd5d7-d7nkd                                1/1     Running     0          2d17h
rook-ceph-osd-0-6b56bc4cb8-jvlwg                                  2/2     Running     0          2d16h
rook-ceph-osd-1-658fbfb747-pt57v                                  2/2     Running     0          2d16h
rook-ceph-osd-2-5665989b75-c54rd                                  2/2     Running     0          2d16h
rook-ceph-osd-prepare-ocs-deviceset-localblocksc-0-data-04smzvx   0/1     Completed   0          2d17h
rook-ceph-osd-prepare-ocs-deviceset-localblocksc-0-data-1s6xw48   0/1     Completed   0          2d17h
rook-ceph-osd-prepare-ocs-deviceset-localblocksc-0-data-2bgnp65   0/1     Completed   0          2d17h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-c4f6995kms7w   2/2     Running     10         2d16h
rook-ceph-tools-76dbc6f57f-vnmlg                                  1/1     Running     0          2d17

Comment 21 Sravika 2021-06-14 08:09:46 UTC
Uploaded the latest logs(upgrade_4_7_1_410_rc2_new.zip) in the google drive as the size is larger than allowed.

https://drive.google.com/file/d/1wflMngza3EQDJxNB4G_E0WYEwWE23HVy/view?usp=sharing

Comment 22 Mudit Agarwal 2021-06-14 16:24:19 UTC
Moving it back to Noobaa to see what happened in https://bugzilla.redhat.com/show_bug.cgi?id=1969309#c19

Other instances (including the ceph health) are not reliable enough as of now because we suspect those either because of worker nodes being down or environment issue.

Comment 23 Sravika 2021-06-15 11:35:40 UTC

In the same environment I have run the "ocs_upgrade" test in isolation excluding "pre_upgrade" and "post_upgrade" tests and the upgrade test PASSED without any timeouts during the ceph health recovery. Also there are no pod restarts observed during the "ocs_upgrade".


The "pre_upgrade" disruptive test (tests/manage/monitoring/test_workload_with_distruptions.py::test_workload_with_checksum 
tests/manage/z_cluster/cluster_expansion/) is causing the worker node to go into "Not Ready" state (BZ #1945016) which resulted in a upgrade failure as the pods are in "Terminating" state.



Uploading both the logs with and without pre_upgrade tests (upgrade_4_7_1_410_rc2_jun14.zip).

Comment 24 Sravika 2021-06-15 11:36:18 UTC
Created attachment 1791238 [details]
upgrade_4_7_1_410_rc2_jun14

Comment 25 Mudit Agarwal 2021-09-07 13:54:57 UTC
This is not seen with the latest builds. 
Closing for now, please reopen if this reproducible.


Note You need to log in before you can comment on or make changes to this bug.