Bug 2159781

Summary:	ODF 4.12 Temporary Disk Loss causes OSD pod to be stuck in CLBO state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Keith Valin <kvalin>
Component:	rook	Assignee:	Travis Nielsen <tnielsen>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.12	CC:	madam, muagarwa, ocs-bugs, odf-bz-bot, shberry
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-24 15:15:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Keith Valin 2023-01-10 17:47:44 UTC

Description of problem (please be detailed as possible and provide log
snippets):

A temporary loss of an OSD causes it's corresponding OSD pod to enter into a CrashLoopBackOff state, even well after the OSD was brought back.

Relevant Logs:
bluestore(/var/lib/ceph/osd/ceph-2/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-2/block: (6) No such device or address
** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-2: (2) No such file or directory


Version of all relevant components (if applicable):

OCP: v4.12.0-rc6
ODF: v4.12.0-162

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

If a disk is temporarily lost without causing the worker node to crash, then the ODF cluster will run in a degraded state.

Is there any workaround available to the best of your knowledge?

Either forcing the OSD pod to redeploy (by deleting the pod and letting the k8s deployment scale it back up) or rebooting the worker node

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Always reproducible

Can this issue reproduce from the UI?
No.

Steps to Reproduce:
1. Deploy ODF on an AWS OCP cluster using gp2/gp3 volumes (using IPI via openshift-install)
2. Detach a disk used by ODF from AWS
3. Wait 1 minute
4. Reattach the detached disk to it's proper instance


Actual results:
rook-ceph-osd pod for the affected OSD never exits CrashLoopBackOff state.

Expected results:
Eventually the pod should recover and resume running, allowing the OSD to rejoin the pool

Additional info:

Comment 2 Travis Nielsen 2023-01-17 22:38:20 UTC

A few questions:
1. What is the state of the PV while the disk is detached?
2. What is the state of the PV after the disk is reattached?
3. The PVC stays bound through the entire detach and reattach operation, right? 

OSD pods can move between nodes when running in a dynamic environment like AWS, which means that the PVs are detaching and reattaching back to the other node without any issues. This means likely the manual detach and reattach is having other side effects that are causing havoc on the OSD.

Comment 3 Keith Valin 2023-01-18 15:26:16 UTC

Ran a quick check on OCP v4.12 using RBD StorageClass:

The state of the PV is Bound, during the interval when the Disk is detached and when it is reattached.

The PVC is Bound throughout the entire operation as well, correct.

This is the case for all worker nodes, and I did an oc delete rook-ceph-osd on the pod in CLBO state between testing each node to reset the cluster state back to HEALTH_OK.

Comment 4 Keith Valin 2023-01-18 16:09:58 UTC

This is the same for the ocs-deviceset PVs and PVCs

Comment 5 Travis Nielsen 2023-01-18 22:45:00 UTC

(In reply to Keith Valin from comment #3)
> Ran a quick check on OCP v4.12 using RBD StorageClass:
> 
> The state of the PV is Bound, during the interval when the Disk is detached
> and when it is reattached.

You were looking at the PVC/PVs backing the OSDs, or this was with RBD? 
I thought the issue to investigate was with OSDs, no RBD. 
Can you do the test again on the OSD PVs?

> The PVC is Bound throughout the entire operation as well, correct.
> 
> This is the case for all worker nodes, and I did an oc delete rook-ceph-osd
> on the pod in CLBO state between testing each node to reset the cluster
> state back to HEALTH_OK.

After deleting the OSD pod, you're saying the OSD comes back online
without any more failures? Effectively, this means that remounting the
volume must fix the issue. 

If we tried to automate this, the operator would have to watch for the OSD
pods to be in this state and delete the operator pod, but this is quite
intrusive and doesn't really seem like Rook's responsibility.

Can you elaborate on the repro steps? EBS volumes just don't typically
get detached like this. If it's just a test scenario, I don't see
the need to automate the recovery.

Comment 6 Keith Valin 2023-01-23 15:49:04 UTC

This is just a test scenario, with the purpose to see how the cluster performs (and how it recovers) when various things go wrong.  This does seem like the sort of problem ODF should be able to recover from.

I misread your earlier request and RBD has nothing to do with this bug, apologies for the confusion.

OSD PVs and PVCs are Bound throughout the test.
This is the case for all worker nodes, and I did an oc delete rook-ceph-osd on the pod in CLBO state between testing each node to reset the cluster state back to HEALTH_OK.
Another thing I noticed during this: when performing an oc delete on a OSD pod, that pod is stuck in the Terminating state.

Comment 7 Travis Nielsen 2023-01-23 22:44:48 UTC

But how is the volume detached? This is an important detail. If this wouldn't be hit in the real world, my recommendation is that it's not supported to automatically recover, since there is a simple workaround.

Comment 8 Keith Valin 2023-01-24 00:16:37 UTC

Disks were detached using the AWS web portal/command line.
(IE aws ec2 detach-volume --volume-id <volume id>)

Comment 9 Travis Nielsen 2023-01-24 15:15:59 UTC

Ok, per previous comment will close as not supported since there is a workaround to restart the osd pod, and not known why customers would hit this