1835908 – If OCS node with co-located OSD+operator pod is powered off, time taken for operator pod to force delete the OSD ~45 mins (Bug 1830015#c21)

Bug 1835908 - If OCS node with co-located OSD+operator pod is powered off, time taken for operator pod to force delete the OSD ~45 mins (Bug 1830015#c21)

Summary: If OCS node with co-located OSD+operator pod is powered off, time taken for o...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Travis Nielsen
QA Contact:	akarsha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1848184
TreeView+	depends on / blocked

Reported:	2020-05-14 17:30 UTC by Neha Berry
Modified:	2020-09-23 09:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1848184 (view as bug list)
Environment:
Last Closed:	2020-09-15 10:17:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
rook-operator log file (298.75 KB, text/plain) 2020-05-14 20:56 UTC, Jean-Charles Lopez	no flags	Details
rook-operator log file (313.46 KB, text/plain) 2020-05-14 20:56 UTC, Jean-Charles Lopez	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rook rook pull 5490	None	closed	Only run the cluster monitoring goroutines once	2020-12-09 14:49:19 UTC
Github	rook rook pull 5556	None	closed	Handle node failure more gracefully	2020-12-09 14:48:52 UTC
Red Hat Product Errata	RHBA-2020:3754	None	None	None	2020-09-15 10:17:23 UTC

Description Neha Berry 2020-05-14 17:30:41 UTC

Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------------
Node failure: automatically detect stuck mon and osd pods and force delete

Currently, in OCS 4.4, the fix for Bug 1833082 includes force deleting the terminating OSD pod but it can take 10-15 mins after restart.

But, after a discussion with Travis, this BZ is rasied to track the issue reported in Bug 1830015#c21. when an OCS node wih co-located rook-operator and OSD pods is powered off, the OSD pod which is stuck in terminating state may take upto 45 mins to get deleted. This delays the OSD pod to come to Running state on another node in the same AZ. See comment in Bug 1833082#c10 as well for clarity.

Hence, we would need an enhancement so as to automatically detect stuck mon and osd pods to force delete and it should not take 45 mins to recover a pod.


Version of all relevant components (if applicable):
----------------------------------------------------------------------
Tested on OCP 4.3.18
OCS 4.4.0 RC5


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
----------------------------------------------------------------------
Yes

Is there any workaround available to the best of your knowledge?
----------------------------------------------------------------------
The manual workaround of force deletion is not acceptable, hence it was automated.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------------
3

Can this issue reproducible?
----------------------------------------------------------------------
Yes

Can this issue reproduce from the UI?
----------------------------------------------------------------------
No 

If this is a regression, please provide more details to justify this:
----------------------------------------------------------------------

Steps to Reproduce:
----------------------------------------------------------------------
1. Deploy OCS on 3 masters, 6 workers or 3M,3W
2. Power off a node which has both rook-ceph-operator, OSD and MON pods running on it
3. The OSD and MON pods would be stuck in terminating state and it is seen it takes 45 mins for operator pod to delete the Terminating OSD.

For more details, see Additional Info


Actual results:
----------------------------------------------------------------------
if the operator was on the same node with OSD+MON, and was restarted when the node went offline, operator will force delete the pending pod in 45 mins (too long). So the OSD takes >45 mins to recover on another node.

Expected results:
----------------------------------------------------------------------
It should not take 45 mins for operator pod to force delete an OSD if the node which was powered off had co-located operator/OSDs/MON pods.

Additional info:
----------------------------------------------------------------------

Test setup details from https://bugzilla.redhat.com/show_bug.cgi?id=1830015#c21 added here for reference

Logs attached here - https://bugzilla.redhat.com/show_bug.cgi?id=1830015#c22


Deployed OCS 4.4 RC5 with the following
3 masters (m5.2xlarge)
6 workers (i3.4xlarge)

Deployed OCS on 3 nodes

Failed one node from AWS console (running mon.c, osd.0 and rook-operator)
mon.c remains in terminating (old pod) + mon.c in init 0/4 (new pod)
osd.0 remains in terminating (old pod) + osd.0 in init 0/4 (new pod)
rokk-operator remains in terminating (old pod) + rook-operator in Running(new pod)

Both new pods show the multi-attach error.

After 45 minutes osd.0 terminating pod was effectively deleted by rook-ceph-operator
New rook-operator pod start of log file
2020-05-11 17:59:29.727976 I | rookcmd: starting Rook 4.4-13.3a9bf923.release_4.4 with arguments '/usr/local/bin/rook ceph operator'
2020-05-11 17:59:29.728101 I | rookcmd: flag values: --add_dir_header=false, --alsologtostderr=false, --csi-attacher-image=registry.redhat.io/openshift4/ose-csi-external-attacher@sha256:fb9f73ed22b4241eba25e71b63aa6729da

New rook-operator pod message when finally detecting the stuck osd.0
2020-05-11 18:45:19.962313 I | op-osd: force deleting pod "rook-ceph-osd-0-54c6c7ff5b-t5bht" that appears to be stuck terminating
2020-05-11 18:45:19.976498 I | op-osd: pod "rook-ceph-osd-0-54c6c7ff5b-t5bht" deletion succeeded

From that point osd.0 goes to running state and the cluster recoevrs but still signals mon.c is DOWN. Manual delete of the terminating mon.c pod finally fixed the cluster.

From Travis analysis the operator got stuck for 20 minutes  waiting for the failed MON (mon.c) to respond, another 20 minutes for the failed OSD to respond (osd.0) and then the expected 5 minute timer to take action.

Restarting the failed nodes had not effect and the cluster remained healthy.

Comment 2 Travis Nielsen 2020-05-14 18:27:25 UTC

There are several enhancements that we need to get this working more smoothly.
- At operator startup, skip waiting for mons and osds that are stuck (so it doesn't take 45 minutes)
- Check more often (every 30-60s) instead of every 5 minutes for OSDs that are stuck
- Check for mons that are stuck, in addition to OSDs
- Start watching for stuck OSDs immediately when the operator starts instead of after the first reconcile completes

There is too much change and risk here to attempt to get it into 4.4, so moving to 4.5.

Comment 3 Jean-Charles Lopez 2020-05-14 20:54:57 UTC

To provide additional information for the future code tested the following scenario first

VMWare environment
- 3 Masters
- 6 Workers

Client Version: 4.3.18
Server Version: 4.4.3
Kubernetes Version: v1.17.1

Deployed 6 OSDs

Failed worker node compute-0 tha runs the following
- mgr.a
- mon.e
- osd.2
- osd.4

Node shutdown at 11:43:30
Node reported as NotReady at 11:44:40

OSD pods were terminated correctly by the new code and restarted
- osd.2 on compute-1
- osd.4 on compute-1

mgr.a restarted but old pod remains in terminating

mon.e restarted but remains in Init 0/2 while old pod remains in Terminating

mon.f started but remains in Pending

Manually terinated all pods stuck in Terminating and the cluster ended up coming back on its feet with mon.f being removed.

I have attached log oper-3.txt for the trace of event.

Manual termination of all

Comment 4 Jean-Charles Lopez 2020-05-14 20:56:22 UTC

Created attachment 1688658 [details]
rook-operator log file

After OSD pods were terminated and recovery started and completed.

Comment 5 Jean-Charles Lopez 2020-05-14 20:56:59 UTC

Created attachment 1688659 [details]
rook-operator log file

rook operator log file after mon.f was removed.

Comment 6 Neha Berry 2020-05-15 08:24:08 UTC

(In reply to Jean-Charles Lopez from comment #3)
> To provide additional information for the future code tested the following
> scenario first
> 
> VMWare environment
> - 3 Masters
> - 6 Workers
> 
> Client Version: 4.3.18
> Server Version: 4.4.3
> Kubernetes Version: v1.17.1
> 
> Deployed 6 OSDs
> 
> Failed worker node compute-0 tha runs the following
> - mgr.a
> - mon.e
> - osd.2
> - osd.4
> 
> Node shutdown at 11:43:30
> Node reported as NotReady at 11:44:40
> 
> OSD pods were terminated correctly by the new code and restarted
> - osd.2 on compute-1
> - osd.4 on compute-1
> 
> mgr.a restarted but old pod remains in terminating
> 
> mon.e restarted but remains in Init 0/2 while old pod remains in Terminating
> 
> mon.f started but remains in Pending
> 
> Manually terinated all pods stuck in Terminating and the cluster ended up
> coming back on its feet with mon.f being removed.
> 
> I have attached log oper-3.txt for the trace of event.
> 
> Manual termination of all

The fact that mon and mgr had to be terminated manually also needs to be fixed (automated) or are we going to recommend manually deleting the terminating pods?

Comment 7 akarsha 2020-05-15 13:56:52 UTC

Tested on AWS IPI environment
- 3 Masters
- 6 Workers

Version:

OCP: 4.4.0-0.nightly-2020-05-15-010811
OCS: v4.4.0-420.ci

Poweroff node: ip-10-0-145-21.us-east-2.compute.internal
Available node on same zone: ip-10-0-158-247.us-east-2.compute.internal

shut down worker node ip-10-0-145-21.us-east-2.compute.internal the runs the following

- noobaa-core-0
- noobaa-db-0
- rook-ceph operator
- mon-a 
- osd-2 
- rook-ceph-tools


Node shutdown at utc: Fri May 15 05:35:12 UTC 2020
Node reported as NotReady at Fri May 15 05:36:17 UTC 2020

OSD pods were forcefully terminated at time 2020-05-15 06:27:13.346650 by operator pod.
and the new osd-2 pod came up and running at the Fri May 15 06:34:25
- osd.2 on ip-10-0-158-247.us-east-2.compute.internal

Immediately when old operator pod went to terminating state, new pod came up and running. But old operator pod remained in terminating state.
Fri May 15 05:41:13 UTC 2020
----------------------------
rook-ceph-operator-5f858bfb9f-6qmxz                               1/1     Terminating         0          77m   10.131.0.24    ip-10-0-145-21.us-east-2.compute.internal    <none>           <none>
rook-ceph-operator-5f858bfb9f-qqcjq                               1/1     Running             0          7s    10.130.2.8     ip-10-0-139-209.us-east-2.compute.internal   <none>           <none>

mon-a started but remains in Pending while old pod remains in Terminating. 
Fri May 15 05:41:13 UTC 2020
----------------------------
rook-ceph-mon-a-5c7bd899f-mj2ft                                   1/1     Terminating         0          75m   10.131.0.29    ip-10-0-145-21.us-east-2.compute.internal    <none>           <none>
rook-ceph-mon-a-5c7bd899f-zlzk7                                   0/1     Pending             0          7s    <none>         <none>                                       <none>           <none>
rook-ceph-mon-b-866cf745bb-lzknx                                  1/1     Running             0          75m   10.129.2.18    ip-10-0-129-136.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-85ccc5d4cf-w5ctz                                  1/1     Running             0          74m   10.128.2.17    ip-10-0-169-205.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-d-5f49d654c8-kqh46                                  1/1     Running             0          51m   10.130.2.6     ip-10-0-139-209.us-east-2.compute.internal   <none>           <none>

mon-d, mon-e and mon-f came up on the all nodes which are available. 
mon-e pod came up on the available node ip-10-0-158-247.us-east-2.compute.internal. Still new mon-a pod remains in pending state.
Fri May 15 06:34:17 UTC 2020
-----------------------------
rook-ceph-mon-a-5c7bd899f-mj2ft                                   1/1     Terminating   0          128m   10.131.0.29    ip-10-0-145-21.us-east-2.compute.internal    <none>           <none>
rook-ceph-mon-a-5c7bd899f-zlzk7                                   0/1     Pending       0          53m    <none>         <none>                                       <none>           <none>
rook-ceph-mon-b-866cf745bb-lzknx                                  1/1     Running       0          128m   10.129.2.18    ip-10-0-129-136.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-85ccc5d4cf-w5ctz                                  1/1     Running       0          127m   10.128.2.17    ip-10-0-169-205.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-d-5f49d654c8-kqh46                                  1/1     Running       0          104m   10.130.2.6     ip-10-0-139-209.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-e-95dc9df-7slgt                                     1/1     Running       0          104m   10.128.4.7     ip-10-0-158-247.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-f-7f7d4b748-pj8zr                                   1/1     Running       0          25s    10.131.2.19    ip-10-0-167-137.us-east-2.compute.internal   <none>           <none>

mon-a was removed by operator at time 2020-05-15 06:34:19.095656 but old mon-a pod remains in Terminating state only.

from rook-ceph-operator log:

2020-05-15 06:34:19.095656 I | exec: Running command: ceph quorum_status --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/906255844
2020-05-15 06:34:27.665548 I | op-mon: Monitors in quorum: [b c d e f]
2020-05-15 06:34:27.665568 I | op-mon: ensuring removal of unhealthy monitor a
2020-05-15 06:34:27.675570 I | exec: Running command: ceph mon remove a --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/727537139
2020-05-15 06:34:28.031273 I | exec: removing mon.a at [v2:172.30.211.244:3300/0,v1:172.30.211.244:6789/0], there will be 5 monitors
2020-05-15 06:34:28.031372 I | op-mon: removed monitor a

Fri May 15 06:34:32 UTC 2020
----------------------------
rook-ceph-mon-a-5c7bd899f-mj2ft                                   1/1     Terminating   0          128m   10.131.0.29    ip-10-0-145-21.us-east-2.compute.internal    <none>           <none>
rook-ceph-mon-b-866cf745bb-lzknx                                  1/1     Running       0          128m   10.129.2.18    ip-10-0-129-136.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-85ccc5d4cf-w5ctz                                  1/1     Running       0          127m   10.128.2.17    ip-10-0-169-205.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-d-5f49d654c8-kqh46                                  1/1     Running       0          105m   10.130.2.6     ip-10-0-139-209.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-e-95dc9df-7slgt                                     1/1     Running       0          104m   10.128.4.7     ip-10-0-158-247.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-f-7f7d4b748-pj8zr                                   1/1     Running       0          40s    10.131.2.19    ip-10-0-167-137.us-east-2.compute.internal   <none>           <none>

Power on the node and manually deleted all pods stuck in Terminating state.

Attached all logs in next comment.

Comment 9 Jean-Charles Lopez 2020-05-15 14:37:00 UTC

Run a second test on the same environment

VMWare environment
- 3 Masters
- 6 Workers

Client Version: 4.3.18
Server Version: 4.4.3
Kubernetes Version: v1.17.1

Deployed 6 OSDs

Failed worker node compute-0 tha runs the following
- mon.e
- osd.2
- osd.4
- rook-operator
- noobaa-core-0

It took ~65 minutes for rook to terminate the OSD pod. It follows the logic highlighted in the other BZ. 20 minute timeout upon rook-operator start. So 20 (mon.e)+20 (osd.2) + 20 (osd.4) + 5 minute timer = 65

Also note that the noobaa-core-0 pod remained in Terminating state also. I guess some work will be needed in the noobaa operator similar to the code added to the rook operator

Comment 12 Prasad Desala 2020-05-27 09:43:56 UTC

We are seeing some issue while running https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_reactive_IPI.py

Below are the steps:
====================
1) Create two fedora based DC app pods using node_selector.
2) Identify both DC app pod and OSD running node and increase the machineset.
3) Wait till the new node comes up and label the node with ocs storage label.
4) Power off the identified node in step-2 from AWS console.
5) Wait till the OCS pods on the failed node failover to other node in the same AZ
6) Fedora based DC app pod should automatically spin on another nodes and reach running state.
7) Do sanity check and health check.

Observations post node failure:
===============================
Note: ocs-operator and ceph-operator pods are not hosting on the failed node.They were running on a healthy running node.

1) Mon pod on the failed node stuck at pending state.

rook-ceph-mon-b-848cbc8b4d-fbdc5                                  1/1     Terminating   0          3h19m
rook-ceph-mon-b-848cbc8b4d-z486j                                  0/1     Init:0/2      0          79m

Events:
  Type     Reason              Age                  From                                                 Message
  ----     ------              ----                 ----                                                 -------
  Normal   Scheduled           <unknown>            default-scheduler                                    Successfully assigned openshift-storage/rook-ceph-mon-b-848cbc8b4d-z486j to ip-10-0-130-135.us-east-2.compute.internal
  Warning  FailedAttachVolume  109m                 attachdetach-controller                              Multi-Attach error for volume "pvc-c9f4909a-1147-4752-9124-76ccf946b7d7" Volume is already used by pod(s) rook-ceph-mon-b-848cbc8b4d-fbdc5
  Warning  FailedMount         42m (x5 over 82m)    kubelet, ip-10-0-130-135.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data default-token-hhsfl rook-config-override]: timed out waiting for the condition
  Warning  FailedMount         23m (x6 over 98m)    kubelet, ip-10-0-130-135.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-log rook-ceph-crash ceph-daemon-data default-token-hhsfl rook-config-override rook-ceph-mons-keyring]: timed out waiting for the condition
  Warning  FailedMount         12m (x5 over 96m)    kubelet, ip-10-0-130-135.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data default-token-hhsfl]: timed out waiting for the condition
  Warning  FailedMount         10m (x7 over 107m)   kubelet, ip-10-0-130-135.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-crash ceph-daemon-data default-token-hhsfl rook-config-override rook-ceph-mons-keyring rook-ceph-log]: timed out waiting for the condition
  Warning  FailedMount         5m58s (x7 over 93m)  kubelet, ip-10-0-130-135.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[default-token-hhsfl rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash ceph-daemon-data]: timed out waiting for the condition
  Warning  FailedMount         85s (x16 over 105m)  kubelet, ip-10-0-130-135.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[ceph-daemon-data default-token-hhsfl rook-config-override rook-ceph-mons-keyring rook-ceph-log rook-ceph-crash]: timed out waiting for the condition


2) Fedora based DC app pod stuck at ContainerCreating state.

pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-jcwhf    0/1     ContainerCreating   0          78m
pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-tgjwt    1/1     Terminating         0          89m

Events:
  Type     Reason              Age                From                                                 Message
  ----     ------              ----               ----                                                 -------
  Normal   Scheduled           <unknown>          default-scheduler                                    Successfully assigned namespace-test-75a6b474bf9c4e3980dcbd5c43c11813/pod-test-rbd-9cb5a95ea8114995883f664d7dfdc5c1-1-mstsw to ip-10-0-144-141.us-east-2.compute.internal
  Warning  FailedAttachVolume  20m                attachdetach-controller                              Multi-Attach error for volume "pvc-d64c1f43-e14b-46b7-8c40-ef0316ba8639" Volume is already used by pod(s) pod-test-rbd-9cb5a95ea8114995883f664d7dfdc5c1-1-2swbx
  Warning  FailedMount         37s (x9 over 18m)  kubelet, ip-10-0-144-141.us-east-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition


When the old pods are forcefully deleted using below commands, both mon and fedora based DC app pod reached running state.

oc delete pod rook-ceph-mon-b-848cbc8b4d-fbdc5 --force --grace-period=0
oc delete pod pod-test-rbd-abfb49d50f214d849f5db0f062ba49b0-1-tgjwt  --force --grace-period=0


@Travis,
This is a similar scenario but the observations are different. 
Are we seeing a different issues here (for mon and DC app pod)?

Comment 14 Travis Nielsen 2020-05-28 17:42:39 UTC

Two PRs for this improvement:
- https://github.com/rook/rook/pull/5490
- https://github.com/rook/rook/pull/5556

The improvements include as stated in comment #2:
- Start monitoring the mons and osds immediately when the operator is restarted: 
- Reduce the time delay between osd checks down to 1 minute (instead of 5 minutes)
- Force delete the stuck mon pods if they are down (the mon health check is every 45s)

Comment 15 Travis Nielsen 2020-05-28 18:32:20 UTC

> @Travis,
> This is a similar scenario but the observations are different. 
> Are we seeing a different issues here (for mon and DC app pod)?

Yes, this is the same behavior. When a node goes suddenly offline and doesn't have a chance to drain the pods, the pods will remain terminating indefinitely if they have volumes attached. K8s doesn't know if it can safely failover the volumes if/until there is proper node fencing.

Comment 16 Prasad Desala 2020-05-29 14:32:19 UTC

Opened a new BZ https://bugzilla.redhat.com/show_bug.cgi?id=1841611 for the DC app pod issue described in comment12.

Comment 17 Travis Nielsen 2020-06-01 19:32:20 UTC

This was merged downstream with https://github.com/openshift/rook/pull/65

Comment 18 Travis Nielsen 2020-06-08 20:03:36 UTC

The proposal is to included these changes in 4.4.z. If that is approved here is the PR ready for merge:
https://github.com/openshift/rook/pull/67

Comment 21 Travis Nielsen 2020-06-16 15:10:37 UTC

Moving back to 4.4.z

Comment 23 Travis Nielsen 2020-06-17 21:03:11 UTC

Yes actually, I'll restore this back to 4.5 and create a clone to 4.4

Comment 25 akarsha 2020-07-15 14:35:25 UTC

Tested on AWS IPI environment

- 3 Masters
- 6 Workers

Version:
OCP: 4.5.0-0.nightly-2020-07-14-213353
OCS: ocs-operator.v4.5.0-487.ci

Poweroff node: ip-10-0-134-101.us-east-2.compute.internal
Available node on same zone: ip-10-0-139-61.us-east-2.compute.internal

shutdown worker node ip-10-0-134-101.us-east-2.compute.internal runs the following pods:
- noobaa-db-0
- noobaa-endpoint
- noobaa-operator
- ocs-operator
- rook-ceph operator
- mon-c 
- osd-1 
- rook-ceph-tools

Node shutdown at utc: Wed Jul 15 08:36:23 UTC 2020

osd-1 and mon-c pod went to terminating state at utc time Wed Jul 15 08:42:04 UTC 2020 

From rook-operator logs: mon-c and osd-1 pods forcely deleted which was stuck in terminating stae

2020-07-15 08:42:45.833169 I | op-k8sutil: force deleting pod "rook-ceph-mon-c-7b9498d6bf-d8dgv" that appears to be stuck terminating
2020-07-15 08:42:46.049325 I | op-k8sutil: pod "rook-ceph-mon-c-7b9498d6bf-d8dgv" deletion succeeded
2020-07-15 08:42:46.063502 I | op-k8sutil: deployment "rook-ceph-mon-c" did not change, nothing to update
2020-07-15 08:42:46.063523 I | op-mon: waiting for mon quorum with [a b c]


2020-07-15 08:43:03.965122 I | clusterdisruption-controller: pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized Count:48} {StateName:active+undersized+degraded Count:48}]". detected drains on "zone": [us-east-2a]
2020-07-15 08:43:04.648102 I | clusterdisruption-controller: pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized Count:48} {StateName:active+undersized+degraded Count:48}]". detected drains on "zone": [us-east-2a]
2020-07-15 08:43:17.138347 I | op-k8sutil: force deleting pod "rook-ceph-osd-1-855bdfb6f9-5zx7m" that appears to be stuck terminating
2020-07-15 08:43:17.165690 I | op-k8sutil: pod "rook-ceph-osd-1-855bdfb6f9-5zx7m" deletion succeeded

New mon-c and osd-1 pod came to running state at utc time Wed Jul 15 08:50:33 UTC 2020

rook-ceph-mon-a-888b9cb8c-q8vds                                   1/1     Running       0          132m    10.131.0.16    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-b-69447bb445-lbrdb                                  1/1     Running       0          132m    10.128.2.16    ip-10-0-168-220.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-7b9498d6bf-hxr5n                                  1/1     Running       0          8m38s   10.130.2.10    ip-10-0-139-61.us-east-2.compute.internal    <none>           <none>

rook-ceph-osd-0-6ff748c449-jxwx9                                  1/1     Running       0          131m    10.131.0.19    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-855bdfb6f9-pmz57                                  1/1     Running       0          8m39s   10.130.2.9     ip-10-0-139-61.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-2-5f84c7bff6-mqvb8                                  1/1     Running       0          130m    10.128.2.21    ip-10-0-168-220.us-east-2.compute.internal   <none>           <none>

Node poweron at utc: Wed Jul 15 09:19:54 UTC 2020


Observation:

- The rook-operator pod respins on another node within a minute or two and it detects the terminating OSD and MON pods and issues force deletion.
- The newly spinned OSD and MON PODs stayed in Init state on another OCS node(in same AZ) for ~8 mins and then moved to Running state.
- The other old pods hosted on shutdown node remains in terminating state, unless the shutdown node bought up. And also observed that noobaa-endpoint keeps restarting and ocs-operator pod is in 0/1 state.
  Is this expected?
  http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1835908/bz1835908/logs-after/pods-when-node-down
  noobaa-db-0                                                       1/1     Terminating   0          154m   10.129.2.25    ip-10-0-134-101.us-east-2.compute.internal   <none>           <none>
  noobaa-endpoint-7d6b47c7d5-2ljfg                                  1/1     Running       5          32m    10.131.2.3     ip-10-0-181-240.us-east-2.compute.internal   <none>           <none>
  noobaa-endpoint-7d6b47c7d5-929c4                                  1/1     Terminating   0          153m   10.129.2.28    ip-10-0-134-101.us-east-2.compute.internal   <none>           <none>
  noobaa-operator-5f9d557669-ccxsg                                  1/1     Running       0          32m    10.128.4.4     ip-10-0-199-89.us-east-2.compute.internal    <none>           <none>
  noobaa-operator-5f9d557669-v7cnp                                  1/1     Terminating   0          158m   10.129.2.15    ip-10-0-134-101.us-east-2.compute.internal   <none>           <none>
  ocs-operator-75b4fbfbff-bdb64                                     1/1     Terminating   0          158m   10.129.2.13    ip-10-0-134-101.us-east-2.compute.internal   <none>           <none>
  ocs-operator-75b4fbfbff-hpznp                                     0/1     Running       0          32m    10.130.2.3     ip-10-0-139-61.us-east-2.compute.internal    <none>           <none>
  rook-ceph-operator-7cd55d84f6-8ftwk                               1/1     Terminating   0          158m   10.129.2.14    ip-10-0-134-101.us-east-2.compute.internal   <none>           <none>
  rook-ceph-operator-7cd55d84f6-gvd52                               1/1     Running       0          32m    10.130.2.4     ip-10-0-139-61.us-east-2.compute.internal    <none>           <none>
  rook-ceph-tools-5f65c6998c-9jrxw                                  1/1     Terminating   0          154m   10.0.134.101   ip-10-0-134-101.us-east-2.compute.internal   <none>           <none>
  rook-ceph-tools-5f65c6998c-sx7n4                                  1/1     Running       0          32m    10.0.199.89    ip-10-0-199-89.us-east-2.compute.internal    <none>           <none>
- Once node is up and running all pods are in running state.
  http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1835908/bz1835908/logs-after/pods-when-all-nodes-up


Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1835908/bz1835908/

Comment 26 akarsha 2020-07-24 06:27:43 UTC

Tested on VMWARE environment

- 3 Masters
- 3 Workers

Version:
OCP: 4.5.0-0.nightly-2020-07-23-201307
OCS: ocs-operator.v4.5.0-494.ci

Poweroff node: compute-0

Pod hosted on node compute-0:
- noobaa-core-0                                              
- noobaa-operator
- ocs-operator
- rook-ceph-mon-a
- rook-ceph-operator
- rook-ceph-osd-0
- rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a
- rook-ceph-tools

Once the compute-0 node down, the rook-operator pod respined on another node within a minute and issued force deletion of osd and mons pod which were in terminating state.

2020-07-24 05:36:13.037287 I | op-k8sutil: force deleting pod "rook-ceph-mon-a-c75986547-4mfkv" that appears to be stuck terminating
2020-07-24 05:36:13.253668 I | op-k8sutil: pod "rook-ceph-mon-a-c75986547-4mfkv" deletion succeeded

2020-07-24 05:36:36.519151 I | op-k8sutil: force deleting pod "rook-ceph-osd-0-68bc4674fc-gsdc4" that appears to be stuck terminating
2020-07-24 05:36:36.539129 I | op-k8sutil: pod "rook-ceph-osd-0-68bc4674fc-gsdc4" deletion succeeded

The newly spinned OSD and MON PODs stayed in Init state for ~8 mins and then moved to Running state. 

When node is up all pods where in running state.

$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2chww                                            3/3     Running     0          50m
csi-cephfsplugin-cqnbr                                            3/3     Running     0          50m
csi-cephfsplugin-jzct8                                            3/3     Running     0          50m
csi-cephfsplugin-provisioner-657697dd75-7twmd                     5/5     Running     0          50m
csi-cephfsplugin-provisioner-657697dd75-h5r9t                     5/5     Running     0          50m
csi-rbdplugin-9krzs                                               3/3     Running     0          50m
csi-rbdplugin-gpz5z                                               3/3     Running     0          50m
csi-rbdplugin-provisioner-bfd6f845d-bx45b                         5/5     Running     0          7m35s
csi-rbdplugin-provisioner-bfd6f845d-k5w9r                         5/5     Running     0          50m
csi-rbdplugin-w72fx                                               3/3     Running     0          50m
noobaa-core-0                                                     1/1     Running     0          5m42s
noobaa-db-0                                                       1/1     Running     0          46m
noobaa-endpoint-dd9cfd7bb-cbz45                                   1/1     Running     0          43m
noobaa-operator-695884dc65-tpssm                                  1/1     Running     0          7m35s
ocs-operator-7f48b9f695-qls4d                                     1/1     Running     0          7m35s
rook-ceph-crashcollector-compute-0-57fc68d86f-lqwfb               1/1     Running     0          7m35s
rook-ceph-crashcollector-compute-1-7b8976cb46-tsbgr               1/1     Running     0          46m
rook-ceph-crashcollector-compute-2-5f69ff4cf5-h27xd               1/1     Running     0          46m
rook-ceph-drain-canary-compute-0-6f56886d99-25qjh                 1/1     Running     0          5m22s
rook-ceph-drain-canary-compute-1-59df658f74-dpbvr                 1/1     Running     0          46m
rook-ceph-drain-canary-compute-2-7cc7b64957-xsr4q                 1/1     Running     0          46m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7ccb46596lr55   1/1     Running     0          45m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-c94fdc949kqd4   1/1     Running     0          45m
rook-ceph-mgr-a-b4f6946b-pnvxn                                    1/1     Running     0          47m
rook-ceph-mon-a-c75986547-4wk46                                   1/1     Running     0          7m35s
rook-ceph-mon-b-c67c764f-mtbbl                                    1/1     Running     0          48m
rook-ceph-mon-c-77f86fbddf-5nk9g                                  1/1     Running     0          47m
rook-ceph-operator-5b587bb9db-xvxp7                               1/1     Running     0          7m34s
rook-ceph-osd-0-68bc4674fc-rvtr4                                  1/1     Running     0          7m34s
rook-ceph-osd-1-c698d9f87-hw8pg                                   1/1     Running     0          46m
rook-ceph-osd-2-6ccc5c4456-lq85n                                  1/1     Running     0          46m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-mbh8v-l8gds          0/1     Completed   0          47m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-6s9sk-j8kmc          0/1     Completed   0          47m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9f79699d5gr4   1/1     Running     0          7m35s
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-b6c55667fg86   1/1     Running     0          45m
rook-ceph-tools-7b6559df78-gmz7d                                  1/1     Running     0          7m35s

Comment 27 akarsha 2020-07-24 06:34:20 UTC

Moving the BZ to verified state based on Comment#25 and Comment#26.

Conclusion:

- The rook-operator pod respins on another node within a minute or two and it detects the terminating OSD and MON pods and issues force deletion.
- The newly spinned OSD and MON PODs stayed in Init state on another OCS node(in same AZ) for ~8 mins and then moved to Running state.
- The other old pods hosted on shutdown node remains in terminating state, unless the shutdown node bought up or forcely deleting it(bz#1783961)
- The new ocs-operator pod is in 0/1 state unless old pod deleted manually or the node bought up and tracked by bz#1858883

Comment 29 errata-xmlrpc 2020-09-15 10:17:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.