Bug 1892623 - [ROKS] OCS cluster breaks if OCP cluster is rebooted
Summary: [ROKS] OCS cluster breaks if OCP cluster is rebooted
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Jose A. Rivera
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On: 1877812
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-29 10:54 UTC by Elvir Kuric
Modified: 2023-09-15 00:50 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-06 11:55:40 UTC
Embargoed:


Attachments (Terms of Use)

Description Elvir Kuric 2020-10-29 10:54:08 UTC
Description of problem:

Functional OCP / OCS cluster is rebooted from IBM web console ( Reboot node(s)). After reboot OCP cluster will be fine ( at least what I see running "oc" commands, but OCS cluster will not be formed

Version-Release number of selected component (if applicable):
OCP : 
version   4.5.6     True        True          9d      Unable to apply 4.5.13: the cluster operator openshift-samples is degraded


OCS : v4.5 

How reproducible:

Tested 2x, reproducable 


Steps to Reproduce:
1. Reboot functional OCP cluster ( with OCS as part of it ) on IBM cloud and check later OCS cluster. 

Actual results:
OCS cluster broken after OCP cluster reboot


Expected results:
OCS cluster to survive reboot 

Additional info:

# oc get pods -n openshift-storage
NAME                                                              READY   STATUS             RESTARTS   AGE
noobaa-core-0                                                     0/1     Completed          0          27h
noobaa-endpoint-f4596b5dd-sjgv4                                   0/1     Error              0          27h
rook-ceph-crashcollector-10.240.64.6-f4885b85b-qgv2r              1/1     Running            1          27h
rook-ceph-crashcollector-10.240.64.7-7f4567d4bc-xdlkl             1/1     Running            1          27h
rook-ceph-crashcollector-10.240.64.8-6c89957b86-bwzdf             1/1     Running            1          27h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-787bcffcd5w6t   0/1     NodeAffinity       0          23d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-787bcffchhdlw   1/1     Running            1          114m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-999c87f4mxggr   1/1     Running            4          23d
rook-ceph-mgr-a-796d8848cf-84f64                                  0/1     Completed          0          23d
rook-ceph-mon-a-77d6fbf6d4-b2mhd                                  0/1     NodeAffinity       0          23d
rook-ceph-mon-a-77d6fbf6d4-dqdhw                                  1/1     Running            0          114m
rook-ceph-mon-b-9ffc5fb6d-bthrh                                   1/1     Running            1          23d
rook-ceph-mon-d-7bd7b478cb-stxbk                                  1/1     Running            1          23d
rook-ceph-osd-0-6cb9d7b587-crn5w                                  0/1     Completed          0          23d
rook-ceph-osd-1-8488c87c69-2jkjm                                  0/1     NodeAffinity       0          23d
rook-ceph-osd-2-66588d5dc-qshsf                                   0/1     Error              0          23d
rook-ceph-osd-3-75bdf7bf8d-wgzcq                                  0/1     NodeAffinity       0          23d
rook-ceph-osd-4-c897dc65c-5mtll                                   0/1     Completed          0          23d
rook-ceph-osd-5-579c754749-psx9j                                  0/1     Error              0          23d
rook-ceph-osd-6-ccd75d8fd-6sjpj                                   0/1     Error              0          23d
rook-ceph-osd-7-8cdf449d6-7vkzf                                   0/1     Completed          0          23d
rook-ceph-osd-8-f87d5bcb9-ctkc7                                   0/1     NodeAffinity       0          23d
rook-ceph-osd-prepare-ocs-deviceset-0-data-0-vw567-nhbfm          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-0-data-1-tlsnz-8h9mk          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-0-data-2-klk2d-pxrtk          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-4xdwx-x7fp7          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-1-data-1-nbv4g-tm5rz          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-1-data-2-4bnpc-s2z2q          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-sk7ms-728vn          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-2-data-1-28ddm-gjrn6          0/1     Completed          0          23d
rook-ceph-osd-prepare-ocs-deviceset-2-data-2-qcm47-fswq2          0/1     Completed          0          23d
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5d7595b5r2c9   0/1     CrashLoopBackOff   34         23d
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-896457fzswv8   1/1     Running            34         23d



--- 
oc adm must-gather will be uploaded

Comment 2 Jose A. Rivera 2020-10-30 14:10:08 UTC
We can't really do anything with this little information. :) Apart from the must-gather, both OCS and OCP, we need to know exactly how the reboot was done (e.g. what happened to the nodes). If everything was just shut down at once, then we probably lost the OCS labels on the storage nodes (at least I would assume so by the NodeAffinity messages).

Comment 4 Petr Balogh 2020-11-02 08:27:22 UTC
Is it possible that it's related also to this one BZ which we saw in upgrade:  https://bugzilla.redhat.com/show_bug.cgi?id=1877812

Comment 5 Jose A. Rivera 2020-11-03 13:54:42 UTC
Elvir, sorry I missed your update, I am no longer able to log in to that cluster. If it's still around, update here and let me know over Chat as well.

Comment 7 Jose A. Rivera 2020-11-03 18:04:33 UTC
Offhand, the symptoms seem similar to this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1884318

This is VERIFIED in the latest OCS 4.6 builds, so IBM should be able to test with it once the RC becomes available. Could you let us know how that goes?

Comment 8 Mudit Agarwal 2020-11-17 07:27:50 UTC
This is even fixed with OCS4.5.2 which is already released. Can we test with that? 

Moving it to 4.6z as it is limited to ROKS, once we have results of the testing with OCS4.5.2 we can take a decision of whether this is a dup or needs further investigation.

Comment 9 Sahina Bose 2020-11-30 15:45:40 UTC
Akash, is this issue seen with OCS 4.6 clusters as well?

Comment 10 akgunjal@in.ibm.com 2020-12-01 10:22:18 UTC
Sahina, We have not used OCS 4.6 version yet. We need to check on 4.6 and see the behavior.

Comment 11 Sahina Bose 2021-01-29 07:16:15 UTC
Do you see this issue with OCS 4.6? @ekuric @akgunjal.com

Comment 12 akgunjal@in.ibm.com 2021-01-29 10:01:01 UTC
@sahina: We tested with OCS 4.6 by rebooting a single worker and see after worker reboot OCS was stable. But we see few pods in nodeAffinity state which seems to be stale and ideally should be cleaned up. Posting output after reboot.

```
NAME                                                              READY   STATUS         RESTARTS   AGE   IP               NODE           NOMINATED NODE   READINESS GATES
csi-cephfsplugin-5ng6d                                            3/3     Running        6          81m   10.240.128.4     10.240.128.4   <none>           <none>
csi-cephfsplugin-bcb97                                            3/3     Running        3          81m   10.240.0.4       10.240.0.4     <none>           <none>
csi-cephfsplugin-provisioner-6cd4b7ff64-5lkv5                     6/6     Running        0          41m   172.17.111.50    10.240.128.4   <none>           <none>
csi-cephfsplugin-provisioner-6cd4b7ff64-t7q8s                     6/6     Running        6          81m   172.17.67.11     10.240.64.4    <none>           <none>
csi-cephfsplugin-sjhwt                                            3/3     Running        3          81m   10.240.64.4      10.240.64.4    <none>           <none>
csi-rbdplugin-6mzxp                                               3/3     Running        3          81m   10.240.64.4      10.240.64.4    <none>           <none>
csi-rbdplugin-provisioner-779ff78f45-7fpzf                        6/6     Running        6          81m   172.17.67.47     10.240.64.4    <none>           <none>
csi-rbdplugin-provisioner-779ff78f45-ms42t                        6/6     Running        0          41m   172.17.111.52    10.240.128.4   <none>           <none>
csi-rbdplugin-s4q5d                                               3/3     Running        3          81m   10.240.0.4       10.240.0.4     <none>           <none>
csi-rbdplugin-tx25b                                               3/3     Running        6          81m   10.240.128.4     10.240.128.4   <none>           <none>
noobaa-core-0                                                     1/1     Running        0          39m   172.17.123.119   10.240.0.4     <none>           <none>
noobaa-db-0                                                       1/1     Running        0          39m   172.17.123.113   10.240.0.4     <none>           <none>
noobaa-endpoint-5c47d54889-kn4gl                                  1/1     Running        0          41m   172.17.67.33     10.240.64.4    <none>           <none>
noobaa-operator-69cc7d8fdd-q6ff5                                  1/1     Running        0          41m   172.17.67.34     10.240.64.4    <none>           <none>
ocs-metrics-exporter-66654c4fd9-d68pt                             1/1     Running        1          82m   172.17.111.40    10.240.128.4   <none>           <none>
ocs-operator-6bd85bb854-grhs4                                     1/1     Running        1          82m   172.17.67.52     10.240.64.4    <none>           <none>
rook-ceph-crashcollector-10.240.0.4-7ddd59bf6c-8pzkq              1/1     Running        0          41m   172.17.123.112   10.240.0.4     <none>           <none>
rook-ceph-crashcollector-10.240.128.4-847f7858dd-f9rzj            1/1     Running        1          80m   172.17.111.28    10.240.128.4   <none>           <none>
rook-ceph-crashcollector-10.240.64.4-7fdbb466f9-97z5v             1/1     Running        1          76m   172.17.67.54     10.240.64.4    <none>           <none>
rook-ceph-drain-canary-10.240.0.4-65546789d4-7d4mw                1/1     Running        0          41m   172.17.123.103   10.240.0.4     <none>           <none>
rook-ceph-drain-canary-10.240.128.4-69846b486d-tkwzp              1/1     Running        1          72m   172.17.111.42    10.240.128.4   <none>           <none>
rook-ceph-drain-canary-10.240.64.4-54c5f78b59-6c9wf               1/1     Running        1          72m   172.17.67.49     10.240.64.4    <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-f669c54f49gqk   1/1     Running        5          48m   172.17.111.11    10.240.128.4   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-f669c54fnvq7j   0/1     NodeAffinity   0          70m   <none>           10.240.128.4   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-559b48b4hhxx8   1/1     Running        2          70m   172.17.67.50     10.240.64.4    <none>           <none>
rook-ceph-mgr-a-76b4456967-7ztqw                                  0/1     NodeAffinity   0          73m   <none>           10.240.128.4   <none>           <none>
rook-ceph-mgr-a-76b4456967-rt79l                                  1/1     Running        0          48m   172.17.111.27    10.240.128.4   <none>           <none>
rook-ceph-mon-a-56c7df8d97-fnd7f                                  0/1     NodeAffinity   0          80m   <none>           10.240.128.4   <none>           <none>
rook-ceph-mon-a-56c7df8d97-rtdtx                                  1/1     Running        0          48m   172.17.111.34    10.240.128.4   <none>           <none>
rook-ceph-mon-b-7995498784-zvbmx                                  1/1     Running        0          41m   172.17.123.117   10.240.0.4     <none>           <none>
rook-ceph-mon-d-6546ff9dd6-9qxhw                                  1/1     Running        1          76m   172.17.67.1      10.240.64.4    <none>           <none>
rook-ceph-operator-65b5fcf74f-f7tzn                               1/1     Running        1          82m   172.17.111.20    10.240.128.4   <none>           <none>
rook-ceph-osd-0-6f948db5f8-zznh7                                  1/1     Running        1          72m   172.17.67.57     10.240.64.4    <none>           <none>
rook-ceph-osd-1-548d7f6dcf-x2chg                                  1/1     Running        2          48m   172.17.111.16    10.240.128.4   <none>           <none>
rook-ceph-osd-1-548d7f6dcf-xxg5f                                  0/1     NodeAffinity   0          72m   <none>           10.240.128.4   <none>           <none>
rook-ceph-osd-2-6bb8f76cc-rttxf                                   1/1     Running        0          41m   172.17.123.110   10.240.0.4     <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-r94sf-7q7d7          0/1     Completed      0          73m   172.17.111.45    10.240.128.4   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-mgr98-bqqgd          0/1     Completed      0          73m   172.17.67.23     10.240.64.4    <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-58b97ddbg46q   1/1     Running        1          41m   172.17.67.58     10.240.64.4    <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b947459qlvnx   1/1     Running        5          48m   172.17.111.51    10.240.128.4   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b947459rshng   0/1     NodeAffinity   0          70m   <none>           10.240.128.4   <none>           <none>
```

Comment 13 Sahina Bose 2021-01-29 15:59:35 UTC
@jrivera who can look at this?

Comment 15 Sahina Bose 2021-06-02 16:08:40 UTC
Do we see this issue on ROKS cluster on reboot?

Comment 16 Sahina Bose 2021-06-07 13:10:23 UTC
The issue is not seen by IBM team (as confirmed by Akash on chat). Elvir, are you seeing this issue consistently on reboot of clusters i.e the mgr pod in CLBO state?

Comment 17 Mudit Agarwal 2021-09-06 11:55:40 UTC
Please reopen if this is seen again.

Comment 18 Red Hat Bugzilla 2023-09-15 00:50:21 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.