Bug 2159757

Summary: After shutting down a worker node, some of the rook ceph pods are stuck in a Terminating state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Itzhak <ikave>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.12CC: aaaggarw, akandath, amagrawa, bniver, muagarwa, ocs-bugs, odf-bz-bot, sapillai, sostapov, srai, tnielsen
Target Milestone: ---Flags: srai: needinfo? (aaaggarw)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-23 21:42:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Attaching log file for the testcase none

Description Itzhak 2023-01-10 15:54:36 UTC
Description of problem (please be detailed as possible and provide log
snippets):

When shutting down a worker node, some of the rook ceph pods are stuck in a Terminating state instead of being removed from the cluster.

Version of all relevant components (if applicable):

IBM Z platform, OCP 4.12, ODF 4.12.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
I don't think so. But it doesn't have the expected behavior

Is there any workaround available to the best of your knowledge?
No.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
I am not sure. But I didn't see this behavior with vSphere and AWS in the last year.

Steps to Reproduce:
1. Shut down a worker node. 
2. Check the rook ceph pods' status.


Actual results:
Some of the rook ceph pods are stuck in a Terminating state.

Expected results:
All the rook ceph pods in the Terminating state should be removed from the cluster. 


Additional info:

Comment 4 Aaruni Aggarwal 2023-01-13 05:26:42 UTC
OCP, ODF and ceph version: 

ODF version: 4.12.0-156

OCP version -> 4.12.0-rc.6

ceph version : 16.2.10-90.el8cp (821b516c325c19f31b81b943cd800c2190f1e685) pacific (stable)": 11

Comment 5 Travis Nielsen 2023-01-13 15:36:19 UTC
Did you drain the worker node before shutting it down? If a node is shut down before being drained, the pods that were running on that node can be easily stuck in terminating. You should be able to force delete the pods, but then they will likely be stuck anyway in Pending state while they are waiting for a node to be available for them to start up again. 

If the node is brought back online do the pods properly terminate and restart?

Comment 6 Itzhak 2023-01-15 13:55:09 UTC
No. The goal of the test is to check the rook ceph pods recovery after a worker node failure. 
It was due to an old bug that was raised in the past. 

After the node was brought back, all the pods terminated and restarted, and Ceph health was OK.
It's not a critical issue, but the expectation is that the rook ceph pods will be removed and not stuck in a Terminating state.

Comment 7 Itzhak 2023-01-15 15:05:00 UTC
With the vSphere platform, I didn't see this problem. Here is an example of a vSphere test run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/19374/consoleFull.

Comment 8 Travis Nielsen 2023-01-17 17:31:47 UTC
(In reply to Itzhak from comment #6)
> No. The goal of the test is to check the rook ceph pods recovery after a
> worker node failure. 
> It was due to an old bug that was raised in the past. 
> 
> After the node was brought back, all the pods terminated and restarted, and
> Ceph health was OK.
> It's not a critical issue, but the expectation is that the rook ceph pods
> will be removed and not stuck in a Terminating state.

So everything succeeded after the node was coming back online? Then it sounds like the goal of the test was met, right? In that case I don't understand why there is an issue.

Which pods specifically were stuck in terminating? If they are the mons and OSDs that have node affinity, it's expected for them not to move to another node anyway.

Comment 9 Itzhak 2023-01-18 14:26:25 UTC
Not exactly. 
The test check that the rook ceph pods will not be stuck in a Terminating state - it was something that was implemented in the past from the rook side.

I see that three pods were stuck in a Terminating state: 
rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-6qnx4p   1/1     Terminating   0          6h22m   10.131.0.28     lon06-
rook-ceph-mon-c-6b97765555-rq646                                  2/2     Terminating   0          6h27m   10.131.0.29     lon06-
rook-ceph-osd-0-579764f59-phqqk                                   2/2     Terminating   0      


Here is the old bug I referred to https://bugzilla.redhat.com/show_bug.cgi?id=1861021.

Comment 10 Travis Nielsen 2023-01-18 23:01:22 UTC
A few more questions.
1. Is the operator pod still running? Please share the log.
2. How long did you wait? It can take 5-10 minutes before the pods will be force deleted and allow them to move
3. Do the mon-c and osd-0 have pods stuck in pending while waiting for these to terminate?

Comment 11 Itzhak 2023-01-24 18:03:18 UTC
Sorry for the late reply. 

1. I didn't run the test, so I am unsure about the operator pod log. Maybe Aaruni Aggarwal can share more details about it.

2. The test waits 10 minutes for the pods to be deleted.

3. Yes, I see that there were other pods in the Pending state: 

rook-ceph-mgr-a-55654896fb-h99wm                                  2/2     Running       0          9h      10.129.2.159    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-a-856f7d5784-65crq                                  2/2     Running       0          10h     10.129.2.152    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-c-6b97765555-pwcv7                                  0/2     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-mon-c-6b97765555-rq646                                  2/2     Terminating   0          6h27m   10.131.0.29     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-d-db9dfc74f-qxpv6                                   2/2     Running       0          6h52m   10.128.2.116    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-operator-65c7df8664-f9htk                               1/1     Running       0          6h27m   10.128.2.137    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-0-579764f59-nmpbm                                   0/2     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-osd-0-579764f59-phqqk                                   2/2     Terminating   0          6h27m   10.131.0.24     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-1-59868c476c-fmt5b                                  2/2     Running       0          6h52m   10.128.2.113    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-2-5c94d44ffc-r7zrt                                  2/2     Running       0          10h     10.129.2.145    lon06-worker-2.rdr-site.ibm.com   <none>           <none>


Here is the link to the nodes and pods status: https://github.com/red-hat-storage/ocs-ci/issues/6689#issuecomment-1376159758.

Comment 12 Itzhak 2023-01-24 18:04:53 UTC
Do you have the operator pod logs?

Comment 13 Aaruni Aggarwal 2023-01-31 12:10:28 UTC
Sorry for the delay. I didn't have the cluster, so I created a new one and reran the same test case. Attaching the must-gather logs: 
https://drive.google.com/file/d/1iSSB40AjuVZOkuI-bp9IpeXhM1R6Y_b-/view?usp=sharing

Comment 15 Travis Nielsen 2023-02-02 22:25:41 UTC
Which mons and OSDs are down in the latest repro? In the must-gather, it appears all the mons and OSDs are up and running.

Comment 16 Aaruni Aggarwal 2023-02-07 04:48:06 UTC
Travis, I don't have this cluster now, will re-run it and attach the must-gather as well as post the status of pods.

Comment 17 Travis Nielsen 2023-02-14 15:22:57 UTC
How's the repro or should we close this?

Comment 18 Aaruni Aggarwal 2023-02-17 05:58:34 UTC
Created attachment 1944664 [details]
Attaching log file for the testcase

Comment 19 Aaruni Aggarwal 2023-02-17 06:02:41 UTC
Re-ran the testcase on the new cluster. Status of nodes and pods in openshift-storage namespace while running the testcase. 

nodes: 

(venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get nodes
NAME                                   STATUS     ROLES                  AGE   VERSION
lon06-master-0.rdr-tier-test.ibm.com   Ready      control-plane,master   14h   v1.25.4+a34b9e9
lon06-master-1.rdr-tier-test.ibm.com   Ready      control-plane,master   14h   v1.25.4+a34b9e9
lon06-master-2.rdr-tier-test.ibm.com   Ready      control-plane,master   14h   v1.25.4+a34b9e9
lon06-worker-0.rdr-tier-test.ibm.com   NotReady   worker                 14h   v1.25.4+a34b9e9
lon06-worker-1.rdr-tier-test.ibm.com   Ready      worker                 14h   v1.25.4+a34b9e9
lon06-worker-2.rdr-tier-test.ibm.com   Ready      worker                 14h   v1.25.4+a34b9e9

pods: 

(venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get pods -n openshift-storage -o wide
NAME                                                              READY   STATUS        RESTARTS   AGE     IP              NODE                                   NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-78fcdd568f-pt58p                    2/2     Running       0          7m48s   10.128.2.37     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-addons-controller-manager-78fcdd568f-thwqq                    2/2     Terminating   0          12h     10.129.2.33     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-gtf66                                            2/2     Running       0          12h     192.168.0.191   lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-j7prp                                            2/2     Running       0          12h     192.168.0.118   lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-5d549d8c69-lbb8c                     5/5     Running       0          12h     10.131.0.19     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-5d549d8c69-nfd8z                     5/5     Running       0          7m48s   10.128.2.33     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-5d549d8c69-v2f54                     5/5     Terminating   0          12h     10.129.2.36     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-t97sb                                            2/2     Running       0          12h     192.168.0.210   lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-49b2d                                               3/3     Running       0          12h     192.168.0.210   lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-fk797                                               3/3     Running       0          12h     192.168.0.118   lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-lnjsr                                               3/3     Running       0          12h     192.168.0.191   lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-57bf586bdf-8lzrr                        6/6     Running       0          12h     10.128.2.18     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-57bf586bdf-hmvlr                        6/6     Terminating   0          12h     10.129.2.35     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-57bf586bdf-jjzd6                        6/6     Running       0          7m48s   10.131.0.34     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
noobaa-core-0                                                     1/1     Running       0          12m     10.128.2.30     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
noobaa-db-pg-0                                                    0/1     Init:0/2      0          12m     <none>          lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
noobaa-endpoint-59c888797b-lx6nk                                  1/1     Running       0          12m     10.128.2.32     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
noobaa-operator-5d8bc99c6c-hm4ch                                  1/1     Running       0          12m     10.131.0.33     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
ocs-metrics-exporter-7f9d9d7b4d-fvm29                             1/1     Terminating   0          12h     10.129.2.30     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
ocs-metrics-exporter-7f9d9d7b4d-mcssc                             1/1     Running       0          7m48s   10.128.2.42     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
ocs-operator-75bc947494-hvpld                                     1/1     Terminating   0          12h     10.129.2.29     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
ocs-operator-75bc947494-jzm2d                                     1/1     Running       0          7m48s   10.128.2.41     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
odf-console-b58fcd554-mm2vz                                       1/1     Running       0          12h     10.131.0.18     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
odf-operator-controller-manager-dd7849bf5-p8v8z                   2/2     Running       0          12h     10.128.2.17     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.c8k2b   1/1     Terminating   0          12h     10.129.2.39     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.vgz5q   0/1     Pending       0          7m48s   <none>          <none>                                 <none>           <none>
rook-ceph-crashcollector-lon06-worker-1.rdr-tier-test.ibm.bkdfs   1/1     Running       0          12h     10.131.0.30     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-2.rdr-tier-test.ibm.ztzht   1/1     Running       0          12h     10.128.2.25     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-694c449bsfgdk   2/2     Running       0          12h     10.131.0.29     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-bbc7bf9f2jtbg   2/2     Running       0          12h     10.128.2.24     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mgr-a-c58f7c4-8z6f4                                     2/2     Running       0          12h     10.131.0.22     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mon-a-85bddf5f88-8jxkf                                  2/2     Running       0          12h     10.128.2.20     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mon-b-565949775b-zx7pl                                  2/2     Running       0          12h     10.131.0.21     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mon-c-6df78cddcd-6zfcb                                  0/2     Pending       0          7m48s   <none>          <none>                                 <none>           <none>
rook-ceph-mon-c-6df78cddcd-kzbm8                                  2/2     Terminating   0          12h     10.129.2.38     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-operator-5bbff458c8-59tw8                               1/1     Running       0          12h     10.128.2.16     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-0-564c995c75-s6hkt                                  2/2     Running       0          12h     10.128.2.23     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-1-59f8699f8-d8hf9                                   2/2     Running       0          12h     10.131.0.25     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-2-79d7d8dd99-42xp4                                  0/2     Pending       0          7m48s   <none>          <none>                                 <none>           <none>
rook-ceph-osd-2-79d7d8dd99-6hd2w                                  2/2     Terminating   0          12h     10.129.2.41     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-prepare-a62fd06e0299d66f062209ad29b67bf1-b2c5b      0/1     Completed     0          12h     10.131.0.24     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-prepare-db6e1cbc6569656a625958d384d0a7d5-m29cn      0/1     Completed     0          12h     10.128.2.22     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7f7b5c7lt8r6   2/2     Running       0          12h     10.131.0.31     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-tools-65ffd8b86d-jfvlb                                  1/1     Running       0          7m48s   10.128.2.36     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-tools-65ffd8b86d-pn8wk                                  1/1     Terminating   0          12h     10.129.2.42     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>

Must-gather logs: 

https://drive.google.com/file/d/1QDp-QBtZh6FOGUe4gKJeNWNLCHbp_Vq7/view?usp=sharing

Comment 20 Aaruni Aggarwal 2023-02-17 06:03:14 UTC
Re-ran the testcase on the new cluster. Status of nodes and pods in openshift-storage namespace while running the testcase. 

nodes: 

(venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get nodes
NAME                                   STATUS     ROLES                  AGE   VERSION
lon06-master-0.rdr-tier-test.ibm.com   Ready      control-plane,master   14h   v1.25.4+a34b9e9
lon06-master-1.rdr-tier-test.ibm.com   Ready      control-plane,master   14h   v1.25.4+a34b9e9
lon06-master-2.rdr-tier-test.ibm.com   Ready      control-plane,master   14h   v1.25.4+a34b9e9
lon06-worker-0.rdr-tier-test.ibm.com   NotReady   worker                 14h   v1.25.4+a34b9e9
lon06-worker-1.rdr-tier-test.ibm.com   Ready      worker                 14h   v1.25.4+a34b9e9
lon06-worker-2.rdr-tier-test.ibm.com   Ready      worker                 14h   v1.25.4+a34b9e9

pods: 

(venv) [root@rdr-tier-test-lon06-bastion-0 ocs-ci]# oc get pods -n openshift-storage -o wide
NAME                                                              READY   STATUS        RESTARTS   AGE     IP              NODE                                   NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-78fcdd568f-pt58p                    2/2     Running       0          7m48s   10.128.2.37     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-addons-controller-manager-78fcdd568f-thwqq                    2/2     Terminating   0          12h     10.129.2.33     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-gtf66                                            2/2     Running       0          12h     192.168.0.191   lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-j7prp                                            2/2     Running       0          12h     192.168.0.118   lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-5d549d8c69-lbb8c                     5/5     Running       0          12h     10.131.0.19     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-5d549d8c69-nfd8z                     5/5     Running       0          7m48s   10.128.2.33     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-5d549d8c69-v2f54                     5/5     Terminating   0          12h     10.129.2.36     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-cephfsplugin-t97sb                                            2/2     Running       0          12h     192.168.0.210   lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-49b2d                                               3/3     Running       0          12h     192.168.0.210   lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-fk797                                               3/3     Running       0          12h     192.168.0.118   lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-lnjsr                                               3/3     Running       0          12h     192.168.0.191   lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-57bf586bdf-8lzrr                        6/6     Running       0          12h     10.128.2.18     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-57bf586bdf-hmvlr                        6/6     Terminating   0          12h     10.129.2.35     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-57bf586bdf-jjzd6                        6/6     Running       0          7m48s   10.131.0.34     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
noobaa-core-0                                                     1/1     Running       0          12m     10.128.2.30     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
noobaa-db-pg-0                                                    0/1     Init:0/2      0          12m     <none>          lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
noobaa-endpoint-59c888797b-lx6nk                                  1/1     Running       0          12m     10.128.2.32     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
noobaa-operator-5d8bc99c6c-hm4ch                                  1/1     Running       0          12m     10.131.0.33     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
ocs-metrics-exporter-7f9d9d7b4d-fvm29                             1/1     Terminating   0          12h     10.129.2.30     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
ocs-metrics-exporter-7f9d9d7b4d-mcssc                             1/1     Running       0          7m48s   10.128.2.42     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
ocs-operator-75bc947494-hvpld                                     1/1     Terminating   0          12h     10.129.2.29     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
ocs-operator-75bc947494-jzm2d                                     1/1     Running       0          7m48s   10.128.2.41     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
odf-console-b58fcd554-mm2vz                                       1/1     Running       0          12h     10.131.0.18     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
odf-operator-controller-manager-dd7849bf5-p8v8z                   2/2     Running       0          12h     10.128.2.17     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.c8k2b   1/1     Terminating   0          12h     10.129.2.39     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-0.rdr-tier-test.ibm.vgz5q   0/1     Pending       0          7m48s   <none>          <none>                                 <none>           <none>
rook-ceph-crashcollector-lon06-worker-1.rdr-tier-test.ibm.bkdfs   1/1     Running       0          12h     10.131.0.30     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-2.rdr-tier-test.ibm.ztzht   1/1     Running       0          12h     10.128.2.25     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-694c449bsfgdk   2/2     Running       0          12h     10.131.0.29     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-bbc7bf9f2jtbg   2/2     Running       0          12h     10.128.2.24     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mgr-a-c58f7c4-8z6f4                                     2/2     Running       0          12h     10.131.0.22     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mon-a-85bddf5f88-8jxkf                                  2/2     Running       0          12h     10.128.2.20     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mon-b-565949775b-zx7pl                                  2/2     Running       0          12h     10.131.0.21     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-mon-c-6df78cddcd-6zfcb                                  0/2     Pending       0          7m48s   <none>          <none>                                 <none>           <none>
rook-ceph-mon-c-6df78cddcd-kzbm8                                  2/2     Terminating   0          12h     10.129.2.38     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-operator-5bbff458c8-59tw8                               1/1     Running       0          12h     10.128.2.16     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-0-564c995c75-s6hkt                                  2/2     Running       0          12h     10.128.2.23     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-1-59f8699f8-d8hf9                                   2/2     Running       0          12h     10.131.0.25     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-2-79d7d8dd99-42xp4                                  0/2     Pending       0          7m48s   <none>          <none>                                 <none>           <none>
rook-ceph-osd-2-79d7d8dd99-6hd2w                                  2/2     Terminating   0          12h     10.129.2.41     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-prepare-a62fd06e0299d66f062209ad29b67bf1-b2c5b      0/1     Completed     0          12h     10.131.0.24     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-osd-prepare-db6e1cbc6569656a625958d384d0a7d5-m29cn      0/1     Completed     0          12h     10.128.2.22     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7f7b5c7lt8r6   2/2     Running       0          12h     10.131.0.31     lon06-worker-1.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-tools-65ffd8b86d-jfvlb                                  1/1     Running       0          7m48s   10.128.2.36     lon06-worker-2.rdr-tier-test.ibm.com   <none>           <none>
rook-ceph-tools-65ffd8b86d-pn8wk                                  1/1     Terminating   0          12h     10.129.2.42     lon06-worker-0.rdr-tier-test.ibm.com   <none>           <none>

Must-gather logs: 

https://drive.google.com/file/d/1QDp-QBtZh6FOGUe4gKJeNWNLCHbp_Vq7/view?usp=sharing

Comment 21 Aaruni Aggarwal 2023-02-17 06:09:32 UTC
My apologies, due to some issues, the same information got commented twice.

Comment 22 Santosh Pillai 2023-02-28 10:36:50 UTC
(In reply to Aaruni Aggarwal from comment #20)

> Must-gather logs: 
> 
> https://drive.google.com/file/d/1QDp-QBtZh6FOGUe4gKJeNWNLCHbp_Vq7/
> view?usp=sharing

Its not opening for me. Showing some issue with the format.

Comment 23 Aaruni Aggarwal 2023-03-02 04:07:20 UTC
Could you please check this one - https://drive.google.com/file/d/1DeKLPlkVRZ9CqBEwwdo7OCl77MJ0Mzdo/view?usp=sharing

Comment 25 Santosh Pillai 2023-03-27 02:36:59 UTC
Hi Aaruni

In the logs attached in comment 23, the cluster status is back to `Health Ok` and no pending/terminating pods are there. Any chance you have the must-gather logs for the cluster when its has pending/terminating pods?

Comment 27 Aaruni Aggarwal 2023-04-17 11:36:24 UTC
Yes, once the nodes reached the Ready state, ceph health went to HEALTH_OK, and the pods were Ready state. 

No Santosh, I don't have the cluster with me now.

Comment 28 Subham Rai 2023-06-12 13:23:41 UTC
I looked at both the must-gather attached, and in both, I see pods went to running state. could get the must-gather when pods are stuck in a terminating state.
Better if we can get the cluster live. Thanks

Comment 29 Travis Nielsen 2023-06-23 21:42:22 UTC
Closing due to inactivity. Please reopen if we can get a live repro to debug.