Description of problem: Functional OCP / OCS cluster is rebooted from IBM web console ( Reboot node(s)). After reboot OCP cluster will be fine ( at least what I see running "oc" commands, but OCS cluster will not be formed Version-Release number of selected component (if applicable): OCP : version 4.5.6 True True 9d Unable to apply 4.5.13: the cluster operator openshift-samples is degraded OCS : v4.5 How reproducible: Tested 2x, reproducable Steps to Reproduce: 1. Reboot functional OCP cluster ( with OCS as part of it ) on IBM cloud and check later OCS cluster. Actual results: OCS cluster broken after OCP cluster reboot Expected results: OCS cluster to survive reboot Additional info: # oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE noobaa-core-0 0/1 Completed 0 27h noobaa-endpoint-f4596b5dd-sjgv4 0/1 Error 0 27h rook-ceph-crashcollector-10.240.64.6-f4885b85b-qgv2r 1/1 Running 1 27h rook-ceph-crashcollector-10.240.64.7-7f4567d4bc-xdlkl 1/1 Running 1 27h rook-ceph-crashcollector-10.240.64.8-6c89957b86-bwzdf 1/1 Running 1 27h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-787bcffcd5w6t 0/1 NodeAffinity 0 23d rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-787bcffchhdlw 1/1 Running 1 114m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-999c87f4mxggr 1/1 Running 4 23d rook-ceph-mgr-a-796d8848cf-84f64 0/1 Completed 0 23d rook-ceph-mon-a-77d6fbf6d4-b2mhd 0/1 NodeAffinity 0 23d rook-ceph-mon-a-77d6fbf6d4-dqdhw 1/1 Running 0 114m rook-ceph-mon-b-9ffc5fb6d-bthrh 1/1 Running 1 23d rook-ceph-mon-d-7bd7b478cb-stxbk 1/1 Running 1 23d rook-ceph-osd-0-6cb9d7b587-crn5w 0/1 Completed 0 23d rook-ceph-osd-1-8488c87c69-2jkjm 0/1 NodeAffinity 0 23d rook-ceph-osd-2-66588d5dc-qshsf 0/1 Error 0 23d rook-ceph-osd-3-75bdf7bf8d-wgzcq 0/1 NodeAffinity 0 23d rook-ceph-osd-4-c897dc65c-5mtll 0/1 Completed 0 23d rook-ceph-osd-5-579c754749-psx9j 0/1 Error 0 23d rook-ceph-osd-6-ccd75d8fd-6sjpj 0/1 Error 0 23d rook-ceph-osd-7-8cdf449d6-7vkzf 0/1 Completed 0 23d rook-ceph-osd-8-f87d5bcb9-ctkc7 0/1 NodeAffinity 0 23d rook-ceph-osd-prepare-ocs-deviceset-0-data-0-vw567-nhbfm 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-0-data-1-tlsnz-8h9mk 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-0-data-2-klk2d-pxrtk 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-1-data-0-4xdwx-x7fp7 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-1-data-1-nbv4g-tm5rz 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-1-data-2-4bnpc-s2z2q 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-2-data-0-sk7ms-728vn 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-2-data-1-28ddm-gjrn6 0/1 Completed 0 23d rook-ceph-osd-prepare-ocs-deviceset-2-data-2-qcm47-fswq2 0/1 Completed 0 23d rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5d7595b5r2c9 0/1 CrashLoopBackOff 34 23d rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-896457fzswv8 1/1 Running 34 23d --- oc adm must-gather will be uploaded
We can't really do anything with this little information. :) Apart from the must-gather, both OCS and OCP, we need to know exactly how the reboot was done (e.g. what happened to the nodes). If everything was just shut down at once, then we probably lost the OCS labels on the storage nodes (at least I would assume so by the NodeAffinity messages).
Is it possible that it's related also to this one BZ which we saw in upgrade: https://bugzilla.redhat.com/show_bug.cgi?id=1877812
Elvir, sorry I missed your update, I am no longer able to log in to that cluster. If it's still around, update here and let me know over Chat as well.
Offhand, the symptoms seem similar to this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1884318 This is VERIFIED in the latest OCS 4.6 builds, so IBM should be able to test with it once the RC becomes available. Could you let us know how that goes?
This is even fixed with OCS4.5.2 which is already released. Can we test with that? Moving it to 4.6z as it is limited to ROKS, once we have results of the testing with OCS4.5.2 we can take a decision of whether this is a dup or needs further investigation.
Akash, is this issue seen with OCS 4.6 clusters as well?
Sahina, We have not used OCS 4.6 version yet. We need to check on 4.6 and see the behavior.
Do you see this issue with OCS 4.6? @ekuric @akgunjal.com
@sahina: We tested with OCS 4.6 by rebooting a single worker and see after worker reboot OCS was stable. But we see few pods in nodeAffinity state which seems to be stale and ideally should be cleaned up. Posting output after reboot. ``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-5ng6d 3/3 Running 6 81m 10.240.128.4 10.240.128.4 <none> <none> csi-cephfsplugin-bcb97 3/3 Running 3 81m 10.240.0.4 10.240.0.4 <none> <none> csi-cephfsplugin-provisioner-6cd4b7ff64-5lkv5 6/6 Running 0 41m 172.17.111.50 10.240.128.4 <none> <none> csi-cephfsplugin-provisioner-6cd4b7ff64-t7q8s 6/6 Running 6 81m 172.17.67.11 10.240.64.4 <none> <none> csi-cephfsplugin-sjhwt 3/3 Running 3 81m 10.240.64.4 10.240.64.4 <none> <none> csi-rbdplugin-6mzxp 3/3 Running 3 81m 10.240.64.4 10.240.64.4 <none> <none> csi-rbdplugin-provisioner-779ff78f45-7fpzf 6/6 Running 6 81m 172.17.67.47 10.240.64.4 <none> <none> csi-rbdplugin-provisioner-779ff78f45-ms42t 6/6 Running 0 41m 172.17.111.52 10.240.128.4 <none> <none> csi-rbdplugin-s4q5d 3/3 Running 3 81m 10.240.0.4 10.240.0.4 <none> <none> csi-rbdplugin-tx25b 3/3 Running 6 81m 10.240.128.4 10.240.128.4 <none> <none> noobaa-core-0 1/1 Running 0 39m 172.17.123.119 10.240.0.4 <none> <none> noobaa-db-0 1/1 Running 0 39m 172.17.123.113 10.240.0.4 <none> <none> noobaa-endpoint-5c47d54889-kn4gl 1/1 Running 0 41m 172.17.67.33 10.240.64.4 <none> <none> noobaa-operator-69cc7d8fdd-q6ff5 1/1 Running 0 41m 172.17.67.34 10.240.64.4 <none> <none> ocs-metrics-exporter-66654c4fd9-d68pt 1/1 Running 1 82m 172.17.111.40 10.240.128.4 <none> <none> ocs-operator-6bd85bb854-grhs4 1/1 Running 1 82m 172.17.67.52 10.240.64.4 <none> <none> rook-ceph-crashcollector-10.240.0.4-7ddd59bf6c-8pzkq 1/1 Running 0 41m 172.17.123.112 10.240.0.4 <none> <none> rook-ceph-crashcollector-10.240.128.4-847f7858dd-f9rzj 1/1 Running 1 80m 172.17.111.28 10.240.128.4 <none> <none> rook-ceph-crashcollector-10.240.64.4-7fdbb466f9-97z5v 1/1 Running 1 76m 172.17.67.54 10.240.64.4 <none> <none> rook-ceph-drain-canary-10.240.0.4-65546789d4-7d4mw 1/1 Running 0 41m 172.17.123.103 10.240.0.4 <none> <none> rook-ceph-drain-canary-10.240.128.4-69846b486d-tkwzp 1/1 Running 1 72m 172.17.111.42 10.240.128.4 <none> <none> rook-ceph-drain-canary-10.240.64.4-54c5f78b59-6c9wf 1/1 Running 1 72m 172.17.67.49 10.240.64.4 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-f669c54f49gqk 1/1 Running 5 48m 172.17.111.11 10.240.128.4 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-f669c54fnvq7j 0/1 NodeAffinity 0 70m <none> 10.240.128.4 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-559b48b4hhxx8 1/1 Running 2 70m 172.17.67.50 10.240.64.4 <none> <none> rook-ceph-mgr-a-76b4456967-7ztqw 0/1 NodeAffinity 0 73m <none> 10.240.128.4 <none> <none> rook-ceph-mgr-a-76b4456967-rt79l 1/1 Running 0 48m 172.17.111.27 10.240.128.4 <none> <none> rook-ceph-mon-a-56c7df8d97-fnd7f 0/1 NodeAffinity 0 80m <none> 10.240.128.4 <none> <none> rook-ceph-mon-a-56c7df8d97-rtdtx 1/1 Running 0 48m 172.17.111.34 10.240.128.4 <none> <none> rook-ceph-mon-b-7995498784-zvbmx 1/1 Running 0 41m 172.17.123.117 10.240.0.4 <none> <none> rook-ceph-mon-d-6546ff9dd6-9qxhw 1/1 Running 1 76m 172.17.67.1 10.240.64.4 <none> <none> rook-ceph-operator-65b5fcf74f-f7tzn 1/1 Running 1 82m 172.17.111.20 10.240.128.4 <none> <none> rook-ceph-osd-0-6f948db5f8-zznh7 1/1 Running 1 72m 172.17.67.57 10.240.64.4 <none> <none> rook-ceph-osd-1-548d7f6dcf-x2chg 1/1 Running 2 48m 172.17.111.16 10.240.128.4 <none> <none> rook-ceph-osd-1-548d7f6dcf-xxg5f 0/1 NodeAffinity 0 72m <none> 10.240.128.4 <none> <none> rook-ceph-osd-2-6bb8f76cc-rttxf 1/1 Running 0 41m 172.17.123.110 10.240.0.4 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-1-data-0-r94sf-7q7d7 0/1 Completed 0 73m 172.17.111.45 10.240.128.4 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-2-data-0-mgr98-bqqgd 0/1 Completed 0 73m 172.17.67.23 10.240.64.4 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-58b97ddbg46q 1/1 Running 1 41m 172.17.67.58 10.240.64.4 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b947459qlvnx 1/1 Running 5 48m 172.17.111.51 10.240.128.4 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b947459rshng 0/1 NodeAffinity 0 70m <none> 10.240.128.4 <none> <none> ```
@jrivera who can look at this?
Do we see this issue on ROKS cluster on reboot?
The issue is not seen by IBM team (as confirmed by Akash on chat). Elvir, are you seeing this issue consistently on reboot of clusters i.e the mgr pod in CLBO state?
Please reopen if this is seen again.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days