Bug 2159781
| Summary: | ODF 4.12 Temporary Disk Loss causes OSD pod to be stuck in CLBO state | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Keith Valin <kvalin> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED WONTFIX | QA Contact: | |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.12 | CC: | madam, muagarwa, ocs-bugs, odf-bz-bot, shberry |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-24 15:15:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Keith Valin
2023-01-10 17:47:44 UTC
A few questions: 1. What is the state of the PV while the disk is detached? 2. What is the state of the PV after the disk is reattached? 3. The PVC stays bound through the entire detach and reattach operation, right? OSD pods can move between nodes when running in a dynamic environment like AWS, which means that the PVs are detaching and reattaching back to the other node without any issues. This means likely the manual detach and reattach is having other side effects that are causing havoc on the OSD. Ran a quick check on OCP v4.12 using RBD StorageClass: The state of the PV is Bound, during the interval when the Disk is detached and when it is reattached. The PVC is Bound throughout the entire operation as well, correct. This is the case for all worker nodes, and I did an oc delete rook-ceph-osd on the pod in CLBO state between testing each node to reset the cluster state back to HEALTH_OK. This is the same for the ocs-deviceset PVs and PVCs (In reply to Keith Valin from comment #3) > Ran a quick check on OCP v4.12 using RBD StorageClass: > > The state of the PV is Bound, during the interval when the Disk is detached > and when it is reattached. You were looking at the PVC/PVs backing the OSDs, or this was with RBD? I thought the issue to investigate was with OSDs, no RBD. Can you do the test again on the OSD PVs? > The PVC is Bound throughout the entire operation as well, correct. > > This is the case for all worker nodes, and I did an oc delete rook-ceph-osd > on the pod in CLBO state between testing each node to reset the cluster > state back to HEALTH_OK. After deleting the OSD pod, you're saying the OSD comes back online without any more failures? Effectively, this means that remounting the volume must fix the issue. If we tried to automate this, the operator would have to watch for the OSD pods to be in this state and delete the operator pod, but this is quite intrusive and doesn't really seem like Rook's responsibility. Can you elaborate on the repro steps? EBS volumes just don't typically get detached like this. If it's just a test scenario, I don't see the need to automate the recovery. This is just a test scenario, with the purpose to see how the cluster performs (and how it recovers) when various things go wrong. This does seem like the sort of problem ODF should be able to recover from. I misread your earlier request and RBD has nothing to do with this bug, apologies for the confusion. OSD PVs and PVCs are Bound throughout the test. This is the case for all worker nodes, and I did an oc delete rook-ceph-osd on the pod in CLBO state between testing each node to reset the cluster state back to HEALTH_OK. Another thing I noticed during this: when performing an oc delete on a OSD pod, that pod is stuck in the Terminating state. But how is the volume detached? This is an important detail. If this wouldn't be hit in the real world, my recommendation is that it's not supported to automatically recover, since there is a simple workaround. Disks were detached using the AWS web portal/command line. (IE aws ec2 detach-volume --volume-id <volume id>) Ok, per previous comment will close as not supported since there is a workaround to restart the osd pod, and not known why customers would hit this |