Bug 1835908
Summary: | If OCS node with co-located OSD+operator pod is powered off, time taken for operator pod to force delete the OSD ~45 mins (Bug 1830015#c21) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Neha Berry <nberry> | ||||||
Component: | rook | Assignee: | Travis Nielsen <tnielsen> | ||||||
Status: | CLOSED ERRATA | QA Contact: | akarsha <akrai> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.4 | CC: | akrai, jelopez, madam, ocs-bugs, ratamir, tdesala, tnielsen | ||||||
Target Milestone: | --- | Keywords: | AutomationBackLog | ||||||
Target Release: | OCS 4.5.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1848184 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-09-15 10:17:01 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1848184 | ||||||||
Attachments: |
|
Description
Neha Berry
2020-05-14 17:30:41 UTC
There are several enhancements that we need to get this working more smoothly. - At operator startup, skip waiting for mons and osds that are stuck (so it doesn't take 45 minutes) - Check more often (every 30-60s) instead of every 5 minutes for OSDs that are stuck - Check for mons that are stuck, in addition to OSDs - Start watching for stuck OSDs immediately when the operator starts instead of after the first reconcile completes There is too much change and risk here to attempt to get it into 4.4, so moving to 4.5. To provide additional information for the future code tested the following scenario first VMWare environment - 3 Masters - 6 Workers Client Version: 4.3.18 Server Version: 4.4.3 Kubernetes Version: v1.17.1 Deployed 6 OSDs Failed worker node compute-0 tha runs the following - mgr.a - mon.e - osd.2 - osd.4 Node shutdown at 11:43:30 Node reported as NotReady at 11:44:40 OSD pods were terminated correctly by the new code and restarted - osd.2 on compute-1 - osd.4 on compute-1 mgr.a restarted but old pod remains in terminating mon.e restarted but remains in Init 0/2 while old pod remains in Terminating mon.f started but remains in Pending Manually terinated all pods stuck in Terminating and the cluster ended up coming back on its feet with mon.f being removed. I have attached log oper-3.txt for the trace of event. Manual termination of all Created attachment 1688658 [details]
rook-operator log file
After OSD pods were terminated and recovery started and completed.
Created attachment 1688659 [details]
rook-operator log file
rook operator log file after mon.f was removed.
(In reply to Jean-Charles Lopez from comment #3) > To provide additional information for the future code tested the following > scenario first > > VMWare environment > - 3 Masters > - 6 Workers > > Client Version: 4.3.18 > Server Version: 4.4.3 > Kubernetes Version: v1.17.1 > > Deployed 6 OSDs > > Failed worker node compute-0 tha runs the following > - mgr.a > - mon.e > - osd.2 > - osd.4 > > Node shutdown at 11:43:30 > Node reported as NotReady at 11:44:40 > > OSD pods were terminated correctly by the new code and restarted > - osd.2 on compute-1 > - osd.4 on compute-1 > > mgr.a restarted but old pod remains in terminating > > mon.e restarted but remains in Init 0/2 while old pod remains in Terminating > > mon.f started but remains in Pending > > Manually terinated all pods stuck in Terminating and the cluster ended up > coming back on its feet with mon.f being removed. > > I have attached log oper-3.txt for the trace of event. > > Manual termination of all The fact that mon and mgr had to be terminated manually also needs to be fixed (automated) or are we going to recommend manually deleting the terminating pods? Tested on AWS IPI environment - 3 Masters - 6 Workers Version: OCP: 4.4.0-0.nightly-2020-05-15-010811 OCS: v4.4.0-420.ci Poweroff node: ip-10-0-145-21.us-east-2.compute.internal Available node on same zone: ip-10-0-158-247.us-east-2.compute.internal shut down worker node ip-10-0-145-21.us-east-2.compute.internal the runs the following - noobaa-core-0 - noobaa-db-0 - rook-ceph operator - mon-a - osd-2 - rook-ceph-tools Node shutdown at utc: Fri May 15 05:35:12 UTC 2020 Node reported as NotReady at Fri May 15 05:36:17 UTC 2020 OSD pods were forcefully terminated at time 2020-05-15 06:27:13.346650 by operator pod. and the new osd-2 pod came up and running at the Fri May 15 06:34:25 - osd.2 on ip-10-0-158-247.us-east-2.compute.internal Immediately when old operator pod went to terminating state, new pod came up and running. But old operator pod remained in terminating state. Fri May 15 05:41:13 UTC 2020 ---------------------------- rook-ceph-operator-5f858bfb9f-6qmxz 1/1 Terminating 0 77m 10.131.0.24 ip-10-0-145-21.us-east-2.compute.internal <none> <none> rook-ceph-operator-5f858bfb9f-qqcjq 1/1 Running 0 7s 10.130.2.8 ip-10-0-139-209.us-east-2.compute.internal <none> <none> mon-a started but remains in Pending while old pod remains in Terminating. Fri May 15 05:41:13 UTC 2020 ---------------------------- rook-ceph-mon-a-5c7bd899f-mj2ft 1/1 Terminating 0 75m 10.131.0.29 ip-10-0-145-21.us-east-2.compute.internal <none> <none> rook-ceph-mon-a-5c7bd899f-zlzk7 0/1 Pending 0 7s <none> <none> <none> <none> rook-ceph-mon-b-866cf745bb-lzknx 1/1 Running 0 75m 10.129.2.18 ip-10-0-129-136.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-85ccc5d4cf-w5ctz 1/1 Running 0 74m 10.128.2.17 ip-10-0-169-205.us-east-2.compute.internal <none> <none> rook-ceph-mon-d-5f49d654c8-kqh46 1/1 Running 0 51m 10.130.2.6 ip-10-0-139-209.us-east-2.compute.internal <none> <none> mon-d, mon-e and mon-f came up on the all nodes which are available. mon-e pod came up on the available node ip-10-0-158-247.us-east-2.compute.internal. Still new mon-a pod remains in pending state. Fri May 15 06:34:17 UTC 2020 ----------------------------- rook-ceph-mon-a-5c7bd899f-mj2ft 1/1 Terminating 0 128m 10.131.0.29 ip-10-0-145-21.us-east-2.compute.internal <none> <none> rook-ceph-mon-a-5c7bd899f-zlzk7 0/1 Pending 0 53m <none> <none> <none> <none> rook-ceph-mon-b-866cf745bb-lzknx 1/1 Running 0 128m 10.129.2.18 ip-10-0-129-136.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-85ccc5d4cf-w5ctz 1/1 Running 0 127m 10.128.2.17 ip-10-0-169-205.us-east-2.compute.internal <none> <none> rook-ceph-mon-d-5f49d654c8-kqh46 1/1 Running 0 104m 10.130.2.6 ip-10-0-139-209.us-east-2.compute.internal <none> <none> rook-ceph-mon-e-95dc9df-7slgt 1/1 Running 0 104m 10.128.4.7 ip-10-0-158-247.us-east-2.compute.internal <none> <none> rook-ceph-mon-f-7f7d4b748-pj8zr 1/1 Running 0 25s 10.131.2.19 ip-10-0-167-137.us-east-2.compute.internal <none> <none> mon-a was removed by operator at time 2020-05-15 06:34:19.095656 but old mon-a pod remains in Terminating state only. from rook-ceph-operator log: 2020-05-15 06:34:19.095656 I | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/906255844 2020-05-15 06:34:27.665548 I | op-mon: Monitors in quorum: [b c d e f] 2020-05-15 06:34:27.665568 I | op-mon: ensuring removal of unhealthy monitor a 2020-05-15 06:34:27.675570 I | exec: Running command: ceph mon remove a --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/727537139 2020-05-15 06:34:28.031273 I | exec: removing mon.a at [v2:172.30.211.244:3300/0,v1:172.30.211.244:6789/0], there will be 5 monitors 2020-05-15 06:34:28.031372 I | op-mon: removed monitor a Fri May 15 06:34:32 UTC 2020 ---------------------------- rook-ceph-mon-a-5c7bd899f-mj2ft 1/1 Terminating 0 128m 10.131.0.29 ip-10-0-145-21.us-east-2.compute.internal <none> <none> rook-ceph-mon-b-866cf745bb-lzknx 1/1 Running 0 128m 10.129.2.18 ip-10-0-129-136.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-85ccc5d4cf-w5ctz 1/1 Running 0 127m 10.128.2.17 ip-10-0-169-205.us-east-2.compute.internal <none> <none> rook-ceph-mon-d-5f49d654c8-kqh46 1/1 Running 0 105m 10.130.2.6 ip-10-0-139-209.us-east-2.compute.internal <none> <none> rook-ceph-mon-e-95dc9df-7slgt 1/1 Running 0 104m 10.128.4.7 ip-10-0-158-247.us-east-2.compute.internal <none> <none> rook-ceph-mon-f-7f7d4b748-pj8zr 1/1 Running 0 40s 10.131.2.19 ip-10-0-167-137.us-east-2.compute.internal <none> <none> Power on the node and manually deleted all pods stuck in Terminating state. Attached all logs in next comment. Run a second test on the same environment VMWare environment - 3 Masters - 6 Workers Client Version: 4.3.18 Server Version: 4.4.3 Kubernetes Version: v1.17.1 Deployed 6 OSDs Failed worker node compute-0 tha runs the following - mon.e - osd.2 - osd.4 - rook-operator - noobaa-core-0 It took ~65 minutes for rook to terminate the OSD pod. It follows the logic highlighted in the other BZ. 20 minute timeout upon rook-operator start. So 20 (mon.e)+20 (osd.2) + 20 (osd.4) + 5 minute timer = 65 Also note that the noobaa-core-0 pod remained in Terminating state also. I guess some work will be needed in the noobaa operator similar to the code added to the rook operator We are seeing some issue while running https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_reactive_IPI.py Below are the steps: ==================== 1) Create two fedora based DC app pods using node_selector. 2) Identify both DC app pod and OSD running node and increase the machineset. 3) Wait till the new node comes up and label the node with ocs storage label. 4) Power off the identified node in step-2 from AWS console. 5) Wait till the OCS pods on the failed node failover to other node in the same AZ 6) Fedora based DC app pod should automatically spin on another nodes and reach running state. 7) Do sanity check and health check. Observations post node failure: =============================== Note: ocs-operator and ceph-operator pods are not hosting on the failed node.They were running on a healthy running node. 1) Mon pod on the failed node stuck at pending state. rook-ceph-mon-b-848cbc8b4d-fbdc5 1/1 Terminating 0 3h19m rook-ceph-mon-b-848cbc8b4d-z486j 0/1 Init:0/2 0 79m Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-storage/rook-ceph-mon-b-848cbc8b4d-z486j to ip-10-0-130-135.us-east-2.compute.internal Warning FailedAttachVolume 109m attachdetach-controller Multi-Attach error for volume "pvc-c9f4909a-1147-4752-9124-76ccf946b7d7" Volume is already used by pod(s) rook-ceph-mon-b-848cbc8b4d-fbdc5 Warning FailedMount 42m (x5 over 82m) kubelet, ip-10-0-130-135.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data default-token-hhsfl rook-config-override]: timed out waiting for the condition Warning FailedMount 23m (x6 over 98m) kubelet, ip-10-0-130-135.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-log rook-ceph-crash ceph-daemon-data default-token-hhsfl rook-config-override rook-ceph-mons-keyring]: timed out waiting for the condition Warning FailedMount 12m (x5 over 96m) kubelet, ip-10-0-130-135.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data default-token-hhsfl]: timed out waiting for the condition Warning FailedMount 10m (x7 over 107m) kubelet, ip-10-0-130-135.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-crash ceph-daemon-data default-token-hhsfl rook-config-override rook-ceph-mons-keyring rook-ceph-log]: timed out waiting for the condition Warning FailedMount 5m58s (x7 over 93m) kubelet, ip-10-0-130-135.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[default-token-hhsfl rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data]: timed out waiting for the condition Warning FailedMount 85s (x16 over 105m) kubelet, ip-10-0-130-135.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[ceph-daemon-data default-token-hhsfl rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash]: timed out waiting for the condition 2) Fedora based DC app pod stuck at ContainerCreating state. pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-jcwhf 0/1 ContainerCreating 0 78m pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-tgjwt 1/1 Terminating 0 89m Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned namespace-test-75a6b474bf9c4e3980dcbd5c43c11813/pod-test-rbd-9cb5a95ea8114995883f664d7dfdc5c1-1-mstsw to ip-10-0-144-141.us-east-2.compute.internal Warning FailedAttachVolume 20m attachdetach-controller Multi-Attach error for volume "pvc-d64c1f43-e14b-46b7-8c40-ef0316ba8639" Volume is already used by pod(s) pod-test-rbd-9cb5a95ea8114995883f664d7dfdc5c1-1-2swbx Warning FailedMount 37s (x9 over 18m) kubelet, ip-10-0-144-141.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition When the old pods are forcefully deleted using below commands, both mon and fedora based DC app pod reached running state. oc delete pod rook-ceph-mon-b-848cbc8b4d-fbdc5 --force --grace-period=0 oc delete pod pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-tgjwt --force --grace-period=0 @Travis, This is a similar scenario but the observations are different. Are we seeing a different issues here (for mon and DC app pod)? Two PRs for this improvement: - https://github.com/rook/rook/pull/5490 - https://github.com/rook/rook/pull/5556 The improvements include as stated in comment #2: - Start monitoring the mons and osds immediately when the operator is restarted: - Reduce the time delay between osd checks down to 1 minute (instead of 5 minutes) - Force delete the stuck mon pods if they are down (the mon health check is every 45s) > @Travis,
> This is a similar scenario but the observations are different.
> Are we seeing a different issues here (for mon and DC app pod)?
Yes, this is the same behavior. When a node goes suddenly offline and doesn't have a chance to drain the pods, the pods will remain terminating indefinitely if they have volumes attached. K8s doesn't know if it can safely failover the volumes if/until there is proper node fencing.
Opened a new BZ https://bugzilla.redhat.com/show_bug.cgi?id=1841611 for the DC app pod issue described in comment12. This was merged downstream with https://github.com/openshift/rook/pull/65 The proposal is to included these changes in 4.4.z. If that is approved here is the PR ready for merge: https://github.com/openshift/rook/pull/67 Moving back to 4.4.z Yes actually, I'll restore this back to 4.5 and create a clone to 4.4 Tested on AWS IPI environment - 3 Masters - 6 Workers Version: OCP: 4.5.0-0.nightly-2020-07-14-213353 OCS: ocs-operator.v4.5.0-487.ci Poweroff node: ip-10-0-134-101.us-east-2.compute.internal Available node on same zone: ip-10-0-139-61.us-east-2.compute.internal shutdown worker node ip-10-0-134-101.us-east-2.compute.internal runs the following pods: - noobaa-db-0 - noobaa-endpoint - noobaa-operator - ocs-operator - rook-ceph operator - mon-c - osd-1 - rook-ceph-tools Node shutdown at utc: Wed Jul 15 08:36:23 UTC 2020 osd-1 and mon-c pod went to terminating state at utc time Wed Jul 15 08:42:04 UTC 2020 From rook-operator logs: mon-c and osd-1 pods forcely deleted which was stuck in terminating stae 2020-07-15 08:42:45.833169 I | op-k8sutil: force deleting pod "rook-ceph-mon-c-7b9498d6bf-d8dgv" that appears to be stuck terminating 2020-07-15 08:42:46.049325 I | op-k8sutil: pod "rook-ceph-mon-c-7b9498d6bf-d8dgv" deletion succeeded 2020-07-15 08:42:46.063502 I | op-k8sutil: deployment "rook-ceph-mon-c" did not change, nothing to update 2020-07-15 08:42:46.063523 I | op-mon: waiting for mon quorum with [a b c] 2020-07-15 08:43:03.965122 I | clusterdisruption-controller: pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized Count:48} {StateName:active+undersized+degraded Count:48}]". detected drains on "zone": [us-east-2a] 2020-07-15 08:43:04.648102 I | clusterdisruption-controller: pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized Count:48} {StateName:active+undersized+degraded Count:48}]". detected drains on "zone": [us-east-2a] 2020-07-15 08:43:17.138347 I | op-k8sutil: force deleting pod "rook-ceph-osd-1-855bdfb6f9-5zx7m" that appears to be stuck terminating 2020-07-15 08:43:17.165690 I | op-k8sutil: pod "rook-ceph-osd-1-855bdfb6f9-5zx7m" deletion succeeded New mon-c and osd-1 pod came to running state at utc time Wed Jul 15 08:50:33 UTC 2020 rook-ceph-mon-a-888b9cb8c-q8vds 1/1 Running 0 132m 10.131.0.16 ip-10-0-205-141.us-east-2.compute.internal <none> <none> rook-ceph-mon-b-69447bb445-lbrdb 1/1 Running 0 132m 10.128.2.16 ip-10-0-168-220.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-7b9498d6bf-hxr5n 1/1 Running 0 8m38s 10.130.2.10 ip-10-0-139-61.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-6ff748c449-jxwx9 1/1 Running 0 131m 10.131.0.19 ip-10-0-205-141.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-855bdfb6f9-pmz57 1/1 Running 0 8m39s 10.130.2.9 ip-10-0-139-61.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5f84c7bff6-mqvb8 1/1 Running 0 130m 10.128.2.21 ip-10-0-168-220.us-east-2.compute.internal <none> <none> Node poweron at utc: Wed Jul 15 09:19:54 UTC 2020 Observation: - The rook-operator pod respins on another node within a minute or two and it detects the terminating OSD and MON pods and issues force deletion. - The newly spinned OSD and MON PODs stayed in Init state on another OCS node(in same AZ) for ~8 mins and then moved to Running state. - The other old pods hosted on shutdown node remains in terminating state, unless the shutdown node bought up. And also observed that noobaa-endpoint keeps restarting and ocs-operator pod is in 0/1 state. Is this expected? http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1835908/bz1835908/logs-after/pods-when-node-down noobaa-db-0 1/1 Terminating 0 154m 10.129.2.25 ip-10-0-134-101.us-east-2.compute.internal <none> <none> noobaa-endpoint-7d6b47c7d5-2ljfg 1/1 Running 5 32m 10.131.2.3 ip-10-0-181-240.us-east-2.compute.internal <none> <none> noobaa-endpoint-7d6b47c7d5-929c4 1/1 Terminating 0 153m 10.129.2.28 ip-10-0-134-101.us-east-2.compute.internal <none> <none> noobaa-operator-5f9d557669-ccxsg 1/1 Running 0 32m 10.128.4.4 ip-10-0-199-89.us-east-2.compute.internal <none> <none> noobaa-operator-5f9d557669-v7cnp 1/1 Terminating 0 158m 10.129.2.15 ip-10-0-134-101.us-east-2.compute.internal <none> <none> ocs-operator-75b4fbfbff-bdb64 1/1 Terminating 0 158m 10.129.2.13 ip-10-0-134-101.us-east-2.compute.internal <none> <none> ocs-operator-75b4fbfbff-hpznp 0/1 Running 0 32m 10.130.2.3 ip-10-0-139-61.us-east-2.compute.internal <none> <none> rook-ceph-operator-7cd55d84f6-8ftwk 1/1 Terminating 0 158m 10.129.2.14 ip-10-0-134-101.us-east-2.compute.internal <none> <none> rook-ceph-operator-7cd55d84f6-gvd52 1/1 Running 0 32m 10.130.2.4 ip-10-0-139-61.us-east-2.compute.internal <none> <none> rook-ceph-tools-5f65c6998c-9jrxw 1/1 Terminating 0 154m 10.0.134.101 ip-10-0-134-101.us-east-2.compute.internal <none> <none> rook-ceph-tools-5f65c6998c-sx7n4 1/1 Running 0 32m 10.0.199.89 ip-10-0-199-89.us-east-2.compute.internal <none> <none> - Once node is up and running all pods are in running state. http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1835908/bz1835908/logs-after/pods-when-all-nodes-up Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1835908/bz1835908/ Tested on VMWARE environment - 3 Masters - 3 Workers Version: OCP: 4.5.0-0.nightly-2020-07-23-201307 OCS: ocs-operator.v4.5.0-494.ci Poweroff node: compute-0 Pod hosted on node compute-0: - noobaa-core-0 - noobaa-operator - ocs-operator - rook-ceph-mon-a - rook-ceph-operator - rook-ceph-osd-0 - rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a - rook-ceph-tools Once the compute-0 node down, the rook-operator pod respined on another node within a minute and issued force deletion of osd and mons pod which were in terminating state. 2020-07-24 05:36:13.037287 I | op-k8sutil: force deleting pod "rook-ceph-mon-a-c75986547-4mfkv" that appears to be stuck terminating 2020-07-24 05:36:13.253668 I | op-k8sutil: pod "rook-ceph-mon-a-c75986547-4mfkv" deletion succeeded 2020-07-24 05:36:36.519151 I | op-k8sutil: force deleting pod "rook-ceph-osd-0-68bc4674fc-gsdc4" that appears to be stuck terminating 2020-07-24 05:36:36.539129 I | op-k8sutil: pod "rook-ceph-osd-0-68bc4674fc-gsdc4" deletion succeeded The newly spinned OSD and MON PODs stayed in Init state for ~8 mins and then moved to Running state. When node is up all pods where in running state. $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-2chww 3/3 Running 0 50m csi-cephfsplugin-cqnbr 3/3 Running 0 50m csi-cephfsplugin-jzct8 3/3 Running 0 50m csi-cephfsplugin-provisioner-657697dd75-7twmd 5/5 Running 0 50m csi-cephfsplugin-provisioner-657697dd75-h5r9t 5/5 Running 0 50m csi-rbdplugin-9krzs 3/3 Running 0 50m csi-rbdplugin-gpz5z 3/3 Running 0 50m csi-rbdplugin-provisioner-bfd6f845d-bx45b 5/5 Running 0 7m35s csi-rbdplugin-provisioner-bfd6f845d-k5w9r 5/5 Running 0 50m csi-rbdplugin-w72fx 3/3 Running 0 50m noobaa-core-0 1/1 Running 0 5m42s noobaa-db-0 1/1 Running 0 46m noobaa-endpoint-dd9cfd7bb-cbz45 1/1 Running 0 43m noobaa-operator-695884dc65-tpssm 1/1 Running 0 7m35s ocs-operator-7f48b9f695-qls4d 1/1 Running 0 7m35s rook-ceph-crashcollector-compute-0-57fc68d86f-lqwfb 1/1 Running 0 7m35s rook-ceph-crashcollector-compute-1-7b8976cb46-tsbgr 1/1 Running 0 46m rook-ceph-crashcollector-compute-2-5f69ff4cf5-h27xd 1/1 Running 0 46m rook-ceph-drain-canary-compute-0-6f56886d99-25qjh 1/1 Running 0 5m22s rook-ceph-drain-canary-compute-1-59df658f74-dpbvr 1/1 Running 0 46m rook-ceph-drain-canary-compute-2-7cc7b64957-xsr4q 1/1 Running 0 46m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7ccb46596lr55 1/1 Running 0 45m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-c94fdc949kqd4 1/1 Running 0 45m rook-ceph-mgr-a-b4f6946b-pnvxn 1/1 Running 0 47m rook-ceph-mon-a-c75986547-4wk46 1/1 Running 0 7m35s rook-ceph-mon-b-c67c764f-mtbbl 1/1 Running 0 48m rook-ceph-mon-c-77f86fbddf-5nk9g 1/1 Running 0 47m rook-ceph-operator-5b587bb9db-xvxp7 1/1 Running 0 7m34s rook-ceph-osd-0-68bc4674fc-rvtr4 1/1 Running 0 7m34s rook-ceph-osd-1-c698d9f87-hw8pg 1/1 Running 0 46m rook-ceph-osd-2-6ccc5c4456-lq85n 1/1 Running 0 46m rook-ceph-osd-prepare-ocs-deviceset-1-data-0-mbh8v-l8gds 0/1 Completed 0 47m rook-ceph-osd-prepare-ocs-deviceset-2-data-0-6s9sk-j8kmc 0/1 Completed 0 47m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9f79699d5gr4 1/1 Running 0 7m35s rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b6c55667fg86 1/1 Running 0 45m rook-ceph-tools-7b6559df78-gmz7d 1/1 Running 0 7m35s Moving the BZ to verified state based on Comment#25 and Comment#26. Conclusion: - The rook-operator pod respins on another node within a minute or two and it detects the terminating OSD and MON pods and issues force deletion. - The newly spinned OSD and MON PODs stayed in Init state on another OCS node(in same AZ) for ~8 mins and then moved to Running state. - The other old pods hosted on shutdown node remains in terminating state, unless the shutdown node bought up or forcely deleting it(bz#1783961) - The new ocs-operator pod is in 0/1 state unless old pod deleted manually or the node bought up and tracked by bz#1858883 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754 |