Bug 2159757
| Summary: | After shutting down a worker node, some of the rook ceph pods are stuck in a Terminating state | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Itzhak <ikave> | ||||
| Component: | rook | Assignee: | Subham Rai <srai> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.12 | CC: | aaaggarw, akandath, amagrawa, bniver, muagarwa, ocs-bugs, odf-bz-bot, sapillai, sostapov, srai, tnielsen | ||||
| Target Milestone: | --- | Flags: | srai:
needinfo?
(aaaggarw) |
||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-06-23 21:42:22 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Itzhak
2023-01-10 15:54:36 UTC
OCP, ODF and ceph version: ODF version: 4.12.0-156 OCP version -> 4.12.0-rc.6 ceph version : 16.2.10-90.el8cp (821b516c325c19f31b81b943cd800c2190f1e685) pacific (stable)": 11 Did you drain the worker node before shutting it down? If a node is shut down before being drained, the pods that were running on that node can be easily stuck in terminating. You should be able to force delete the pods, but then they will likely be stuck anyway in Pending state while they are waiting for a node to be available for them to start up again. If the node is brought back online do the pods properly terminate and restart? No. The goal of the test is to check the rook ceph pods recovery after a worker node failure. It was due to an old bug that was raised in the past. After the node was brought back, all the pods terminated and restarted, and Ceph health was OK. It's not a critical issue, but the expectation is that the rook ceph pods will be removed and not stuck in a Terminating state. With the vSphere platform, I didn't see this problem. Here is an example of a vSphere test run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/19374/consoleFull. (In reply to Itzhak from comment #6) > No. The goal of the test is to check the rook ceph pods recovery after a > worker node failure. > It was due to an old bug that was raised in the past. > > After the node was brought back, all the pods terminated and restarted, and > Ceph health was OK. > It's not a critical issue, but the expectation is that the rook ceph pods > will be removed and not stuck in a Terminating state. So everything succeeded after the node was coming back online? Then it sounds like the goal of the test was met, right? In that case I don't understand why there is an issue. Which pods specifically were stuck in terminating? If they are the mons and OSDs that have node affinity, it's expected for them not to move to another node anyway. Not exactly. The test check that the rook ceph pods will not be stuck in a Terminating state - it was something that was implemented in the past from the rook side. I see that three pods were stuck in a Terminating state: rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-6qnx4p 1/1 Terminating 0 6h22m 10.131.0.28 lon06- rook-ceph-mon-c-6b97765555-rq646 2/2 Terminating 0 6h27m 10.131.0.29 lon06- rook-ceph-osd-0-579764f59-phqqk 2/2 Terminating 0 Here is the old bug I referred to https://bugzilla.redhat.com/show_bug.cgi?id=1861021. A few more questions. 1. Is the operator pod still running? Please share the log. 2. How long did you wait? It can take 5-10 minutes before the pods will be force deleted and allow them to move 3. Do the mon-c and osd-0 have pods stuck in pending while waiting for these to terminate? Sorry for the late reply. 1. I didn't run the test, so I am unsure about the operator pod log. Maybe Aaruni Aggarwal can share more details about it. 2. The test waits 10 minutes for the pods to be deleted. 3. Yes, I see that there were other pods in the Pending state: rook-ceph-mgr-a-55654896fb-h99wm 2/2 Running 0 9h 10.129.2.159 lon06-worker-2.rdr-site.ibm.com <none> <none> rook-ceph-mon-a-856f7d5784-65crq 2/2 Running 0 10h 10.129.2.152 lon06-worker-2.rdr-site.ibm.com <none> <none> rook-ceph-mon-c-6b97765555-pwcv7 0/2 Pending 0 9m34s <none> <none> <none> <none> rook-ceph-mon-c-6b97765555-rq646 2/2 Terminating 0 6h27m 10.131.0.29 lon06-worker-1.rdr-site.ibm.com <none> <none> rook-ceph-mon-d-db9dfc74f-qxpv6 2/2 Running 0 6h52m 10.128.2.116 lon06-worker-0.rdr-site.ibm.com <none> <none> rook-ceph-operator-65c7df8664-f9htk 1/1 Running 0 6h27m 10.128.2.137 lon06-worker-0.rdr-site.ibm.com <none> <none> rook-ceph-osd-0-579764f59-nmpbm 0/2 Pending 0 9m34s <none> <none> <none> <none> rook-ceph-osd-0-579764f59-phqqk 2/2 Terminating 0 6h27m 10.131.0.24 lon06-worker-1.rdr-site.ibm.com <none> <none> rook-ceph-osd-1-59868c476c-fmt5b 2/2 Running 0 6h52m 10.128.2.113 lon06-worker-0.rdr-site.ibm.com <none> <none> rook-ceph-osd-2-5c94d44ffc-r7zrt 2/2 Running 0 10h 10.129.2.145 lon06-worker-2.rdr-site.ibm.com <none> <none> Here is the link to the nodes and pods status: https://github.com/red-hat-storage/ocs-ci/issues/6689#issuecomment-1376159758. Do you have the operator pod logs? Sorry for the delay. I didn't have the cluster, so I created a new one and reran the same test case. Attaching the must-gather logs: https://drive.google.com/file/d/1iSSB40AjuVZOkuI-bp9IpeXhM1R6Y_b-/view?usp=sharing Which mons and OSDs are down in the latest repro? In the must-gather, it appears all the mons and OSDs are up and running. Travis, I don't have this cluster now, will re-run it and attach the must-gather as well as post the status of pods. How's the repro or should we close this? Created attachment 1944664 [details]
Attaching log file for the testcase
Re-ran the testcase on the new cluster. Status of nodes and pods in openshift-storage namespace while running the testcase. nodes: (venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get nodes NAME STATUS ROLES AGE VERSION lon06-master-0.rdr-tier-test.ibm.com Ready control-plane,master 14h v1.25.4+a34b9e9 lon06-master-1.rdr-tier-test.ibm.com Ready control-plane,master 14h v1.25.4+a34b9e9 lon06-master-2.rdr-tier-test.ibm.com Ready control-plane,master 14h v1.25.4+a34b9e9 lon06-worker-0.rdr-tier-test.ibm.com NotReady worker 14h v1.25.4+a34b9e9 lon06-worker-1.rdr-tier-test.ibm.com Ready worker 14h v1.25.4+a34b9e9 lon06-worker-2.rdr-tier-test.ibm.com Ready worker 14h v1.25.4+a34b9e9 pods: (venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get pods -n openshift-storage -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-addons-controller-manager-78fcdd568f-pt58p 2/2 Running 0 7m48s 10.128.2.37 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-addons-controller-manager-78fcdd568f-thwqq 2/2 Terminating 0 12h 10.129.2.33 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-gtf66 2/2 Running 0 12h 192.168.0.191 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-j7prp 2/2 Running 0 12h 192.168.0.118 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-provisioner-5d549d8c69-lbb8c 5/5 Running 0 12h 10.131.0.19 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-provisioner-5d549d8c69-nfd8z 5/5 Running 0 7m48s 10.128.2.33 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-provisioner-5d549d8c69-v2f54 5/5 Terminating 0 12h 10.129.2.36 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-t97sb 2/2 Running 0 12h 192.168.0.210 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-49b2d 3/3 Running 0 12h 192.168.0.210 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-fk797 3/3 Running 0 12h 192.168.0.118 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-lnjsr 3/3 Running 0 12h 192.168.0.191 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-provisioner-57bf586bdf-8lzrr 6/6 Running 0 12h 10.128.2.18 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-provisioner-57bf586bdf-hmvlr 6/6 Terminating 0 12h 10.129.2.35 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-provisioner-57bf586bdf-jjzd6 6/6 Running 0 7m48s 10.131.0.34 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> noobaa-core-0 1/1 Running 0 12m 10.128.2.30 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> noobaa-db-pg-0 0/1 Init:0/2 0 12m <none> lon06-worker-2.rdr-tier-test.ibm.com <none> <none> noobaa-endpoint-59c888797b-lx6nk 1/1 Running 0 12m 10.128.2.32 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> noobaa-operator-5d8bc99c6c-hm4ch 1/1 Running 0 12m 10.131.0.33 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> ocs-metrics-exporter-7f9d9d7b4d-fvm29 1/1 Terminating 0 12h 10.129.2.30 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> ocs-metrics-exporter-7f9d9d7b4d-mcssc 1/1 Running 0 7m48s 10.128.2.42 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> ocs-operator-75bc947494-hvpld 1/1 Terminating 0 12h 10.129.2.29 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> ocs-operator-75bc947494-jzm2d 1/1 Running 0 7m48s 10.128.2.41 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> odf-console-b58fcd554-mm2vz 1/1 Running 0 12h 10.131.0.18 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> odf-operator-controller-manager-dd7849bf5-p8v8z 2/2 Running 0 12h 10.128.2.17 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.c8k2b 1/1 Terminating 0 12h 10.129.2.39 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.vgz5q 0/1 Pending 0 7m48s <none> <none> <none> <none> rook-ceph-crashcollector-lon06-worker-1.rdr-tier-test.ibm.bkdfs 1/1 Running 0 12h 10.131.0.30 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-crashcollector-lon06-worker-2.rdr-tier-test.ibm.ztzht 1/1 Running 0 12h 10.128.2.25 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-694c449bsfgdk 2/2 Running 0 12h 10.131.0.29 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-bbc7bf9f2jtbg 2/2 Running 0 12h 10.128.2.24 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-mgr-a-c58f7c4-8z6f4 2/2 Running 0 12h 10.131.0.22 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-mon-a-85bddf5f88-8jxkf 2/2 Running 0 12h 10.128.2.20 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-mon-b-565949775b-zx7pl 2/2 Running 0 12h 10.131.0.21 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-mon-c-6df78cddcd-6zfcb 0/2 Pending 0 7m48s <none> <none> <none> <none> rook-ceph-mon-c-6df78cddcd-kzbm8 2/2 Terminating 0 12h 10.129.2.38 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> rook-ceph-operator-5bbff458c8-59tw8 1/1 Running 0 12h 10.128.2.16 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-0-564c995c75-s6hkt 2/2 Running 0 12h 10.128.2.23 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-1-59f8699f8-d8hf9 2/2 Running 0 12h 10.131.0.25 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-2-79d7d8dd99-42xp4 0/2 Pending 0 7m48s <none> <none> <none> <none> rook-ceph-osd-2-79d7d8dd99-6hd2w 2/2 Terminating 0 12h 10.129.2.41 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-prepare-a62fd06e0299d66f062209ad29b67bf1-b2c5b 0/1 Completed 0 12h 10.131.0.24 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-prepare-db6e1cbc6569656a625958d384d0a7d5-m29cn 0/1 Completed 0 12h 10.128.2.22 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7f7b5c7lt8r6 2/2 Running 0 12h 10.131.0.31 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-tools-65ffd8b86d-jfvlb 1/1 Running 0 7m48s 10.128.2.36 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-tools-65ffd8b86d-pn8wk 1/1 Terminating 0 12h 10.129.2.42 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> Must-gather logs: https://drive.google.com/file/d/1QDp-QBtZh6FOGUe4gKJeNWNLCHbp_Vq7/view?usp=sharing Re-ran the testcase on the new cluster. Status of nodes and pods in openshift-storage namespace while running the testcase. nodes: (venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get nodes NAME STATUS ROLES AGE VERSION lon06-master-0.rdr-tier-test.ibm.com Ready control-plane,master 14h v1.25.4+a34b9e9 lon06-master-1.rdr-tier-test.ibm.com Ready control-plane,master 14h v1.25.4+a34b9e9 lon06-master-2.rdr-tier-test.ibm.com Ready control-plane,master 14h v1.25.4+a34b9e9 lon06-worker-0.rdr-tier-test.ibm.com NotReady worker 14h v1.25.4+a34b9e9 lon06-worker-1.rdr-tier-test.ibm.com Ready worker 14h v1.25.4+a34b9e9 lon06-worker-2.rdr-tier-test.ibm.com Ready worker 14h v1.25.4+a34b9e9 pods: (venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get pods -n openshift-storage -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-addons-controller-manager-78fcdd568f-pt58p 2/2 Running 0 7m48s 10.128.2.37 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-addons-controller-manager-78fcdd568f-thwqq 2/2 Terminating 0 12h 10.129.2.33 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-gtf66 2/2 Running 0 12h 192.168.0.191 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-j7prp 2/2 Running 0 12h 192.168.0.118 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-provisioner-5d549d8c69-lbb8c 5/5 Running 0 12h 10.131.0.19 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-provisioner-5d549d8c69-nfd8z 5/5 Running 0 7m48s 10.128.2.33 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-provisioner-5d549d8c69-v2f54 5/5 Terminating 0 12h 10.129.2.36 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-cephfsplugin-t97sb 2/2 Running 0 12h 192.168.0.210 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-49b2d 3/3 Running 0 12h 192.168.0.210 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-fk797 3/3 Running 0 12h 192.168.0.118 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-lnjsr 3/3 Running 0 12h 192.168.0.191 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-provisioner-57bf586bdf-8lzrr 6/6 Running 0 12h 10.128.2.18 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-provisioner-57bf586bdf-hmvlr 6/6 Terminating 0 12h 10.129.2.35 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> csi-rbdplugin-provisioner-57bf586bdf-jjzd6 6/6 Running 0 7m48s 10.131.0.34 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> noobaa-core-0 1/1 Running 0 12m 10.128.2.30 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> noobaa-db-pg-0 0/1 Init:0/2 0 12m <none> lon06-worker-2.rdr-tier-test.ibm.com <none> <none> noobaa-endpoint-59c888797b-lx6nk 1/1 Running 0 12m 10.128.2.32 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> noobaa-operator-5d8bc99c6c-hm4ch 1/1 Running 0 12m 10.131.0.33 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> ocs-metrics-exporter-7f9d9d7b4d-fvm29 1/1 Terminating 0 12h 10.129.2.30 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> ocs-metrics-exporter-7f9d9d7b4d-mcssc 1/1 Running 0 7m48s 10.128.2.42 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> ocs-operator-75bc947494-hvpld 1/1 Terminating 0 12h 10.129.2.29 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> ocs-operator-75bc947494-jzm2d 1/1 Running 0 7m48s 10.128.2.41 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> odf-console-b58fcd554-mm2vz 1/1 Running 0 12h 10.131.0.18 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> odf-operator-controller-manager-dd7849bf5-p8v8z 2/2 Running 0 12h 10.128.2.17 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.c8k2b 1/1 Terminating 0 12h 10.129.2.39 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.vgz5q 0/1 Pending 0 7m48s <none> <none> <none> <none> rook-ceph-crashcollector-lon06-worker-1.rdr-tier-test.ibm.bkdfs 1/1 Running 0 12h 10.131.0.30 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-crashcollector-lon06-worker-2.rdr-tier-test.ibm.ztzht 1/1 Running 0 12h 10.128.2.25 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-694c449bsfgdk 2/2 Running 0 12h 10.131.0.29 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-bbc7bf9f2jtbg 2/2 Running 0 12h 10.128.2.24 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-mgr-a-c58f7c4-8z6f4 2/2 Running 0 12h 10.131.0.22 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-mon-a-85bddf5f88-8jxkf 2/2 Running 0 12h 10.128.2.20 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-mon-b-565949775b-zx7pl 2/2 Running 0 12h 10.131.0.21 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-mon-c-6df78cddcd-6zfcb 0/2 Pending 0 7m48s <none> <none> <none> <none> rook-ceph-mon-c-6df78cddcd-kzbm8 2/2 Terminating 0 12h 10.129.2.38 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> rook-ceph-operator-5bbff458c8-59tw8 1/1 Running 0 12h 10.128.2.16 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-0-564c995c75-s6hkt 2/2 Running 0 12h 10.128.2.23 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-1-59f8699f8-d8hf9 2/2 Running 0 12h 10.131.0.25 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-2-79d7d8dd99-42xp4 0/2 Pending 0 7m48s <none> <none> <none> <none> rook-ceph-osd-2-79d7d8dd99-6hd2w 2/2 Terminating 0 12h 10.129.2.41 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-prepare-a62fd06e0299d66f062209ad29b67bf1-b2c5b 0/1 Completed 0 12h 10.131.0.24 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-osd-prepare-db6e1cbc6569656a625958d384d0a7d5-m29cn 0/1 Completed 0 12h 10.128.2.22 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7f7b5c7lt8r6 2/2 Running 0 12h 10.131.0.31 lon06-worker-1.rdr-tier-test.ibm.com <none> <none> rook-ceph-tools-65ffd8b86d-jfvlb 1/1 Running 0 7m48s 10.128.2.36 lon06-worker-2.rdr-tier-test.ibm.com <none> <none> rook-ceph-tools-65ffd8b86d-pn8wk 1/1 Terminating 0 12h 10.129.2.42 lon06-worker-0.rdr-tier-test.ibm.com <none> <none> Must-gather logs: https://drive.google.com/file/d/1QDp-QBtZh6FOGUe4gKJeNWNLCHbp_Vq7/view?usp=sharing My apologies, due to some issues, the same information got commented twice. (In reply to Aaruni Aggarwal from comment #20) > Must-gather logs: > > https://drive.google.com/file/d/1QDp-QBtZh6FOGUe4gKJeNWNLCHbp_Vq7/ > view?usp=sharing Its not opening for me. Showing some issue with the format. Could you please check this one - https://drive.google.com/file/d/1DeKLPlkVRZ9CqBEwwdo7OCl77MJ0Mzdo/view?usp=sharing Hi Aaruni In the logs attached in comment 23, the cluster status is back to `Health Ok` and no pending/terminating pods are there. Any chance you have the must-gather logs for the cluster when its has pending/terminating pods? Yes, once the nodes reached the Ready state, ceph health went to HEALTH_OK, and the pods were Ready state. No Santosh, I don't have the cluster with me now. I looked at both the must-gather attached, and in both, I see pods went to running state. could get the must-gather when pods are stuck in a terminating state. Better if we can get the cluster live. Thanks Closing due to inactivity. Please reopen if we can get a live repro to debug. |