Description of problem (please be detailed as possible and provide log snippets): A temporary loss of an OSD causes it's corresponding OSD pod to enter into a CrashLoopBackOff state, even well after the OSD was brought back. Relevant Logs: bluestore(/var/lib/ceph/osd/ceph-2/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-2/block: (6) No such device or address ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-2: (2) No such file or directory Version of all relevant components (if applicable): OCP: v4.12.0-rc6 ODF: v4.12.0-162 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? If a disk is temporarily lost without causing the worker node to crash, then the ODF cluster will run in a degraded state. Is there any workaround available to the best of your knowledge? Either forcing the OSD pod to redeploy (by deleting the pod and letting the k8s deployment scale it back up) or rebooting the worker node Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Always reproducible Can this issue reproduce from the UI? No. Steps to Reproduce: 1. Deploy ODF on an AWS OCP cluster using gp2/gp3 volumes (using IPI via openshift-install) 2. Detach a disk used by ODF from AWS 3. Wait 1 minute 4. Reattach the detached disk to it's proper instance Actual results: rook-ceph-osd pod for the affected OSD never exits CrashLoopBackOff state. Expected results: Eventually the pod should recover and resume running, allowing the OSD to rejoin the pool Additional info:
A few questions: 1. What is the state of the PV while the disk is detached? 2. What is the state of the PV after the disk is reattached? 3. The PVC stays bound through the entire detach and reattach operation, right? OSD pods can move between nodes when running in a dynamic environment like AWS, which means that the PVs are detaching and reattaching back to the other node without any issues. This means likely the manual detach and reattach is having other side effects that are causing havoc on the OSD.
Ran a quick check on OCP v4.12 using RBD StorageClass: The state of the PV is Bound, during the interval when the Disk is detached and when it is reattached. The PVC is Bound throughout the entire operation as well, correct. This is the case for all worker nodes, and I did an oc delete rook-ceph-osd on the pod in CLBO state between testing each node to reset the cluster state back to HEALTH_OK.
This is the same for the ocs-deviceset PVs and PVCs
(In reply to Keith Valin from comment #3) > Ran a quick check on OCP v4.12 using RBD StorageClass: > > The state of the PV is Bound, during the interval when the Disk is detached > and when it is reattached. You were looking at the PVC/PVs backing the OSDs, or this was with RBD? I thought the issue to investigate was with OSDs, no RBD. Can you do the test again on the OSD PVs? > The PVC is Bound throughout the entire operation as well, correct. > > This is the case for all worker nodes, and I did an oc delete rook-ceph-osd > on the pod in CLBO state between testing each node to reset the cluster > state back to HEALTH_OK. After deleting the OSD pod, you're saying the OSD comes back online without any more failures? Effectively, this means that remounting the volume must fix the issue. If we tried to automate this, the operator would have to watch for the OSD pods to be in this state and delete the operator pod, but this is quite intrusive and doesn't really seem like Rook's responsibility. Can you elaborate on the repro steps? EBS volumes just don't typically get detached like this. If it's just a test scenario, I don't see the need to automate the recovery.
This is just a test scenario, with the purpose to see how the cluster performs (and how it recovers) when various things go wrong. This does seem like the sort of problem ODF should be able to recover from. I misread your earlier request and RBD has nothing to do with this bug, apologies for the confusion. OSD PVs and PVCs are Bound throughout the test. This is the case for all worker nodes, and I did an oc delete rook-ceph-osd on the pod in CLBO state between testing each node to reset the cluster state back to HEALTH_OK. Another thing I noticed during this: when performing an oc delete on a OSD pod, that pod is stuck in the Terminating state.
But how is the volume detached? This is an important detail. If this wouldn't be hit in the real world, my recommendation is that it's not supported to automatically recover, since there is a simple workaround.
Disks were detached using the AWS web portal/command line. (IE aws ec2 detach-volume --volume-id <volume id>)
Ok, per previous comment will close as not supported since there is a workaround to restart the osd pod, and not known why customers would hit this