Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1962222

Summary: OCS 4.6 UI replacement: Replaced OSD in CLBO as PV got created on the orphaned symlink
Product: OpenShift Container Platform Reporter: Neha Berry <nberry>
Component: StorageAssignee: Rohan CJ <rojoseph>
Storage sub component: Local Storage Operator QA Contact: Wei Duan <wduan>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, assingh, jsafrane, madam, ocs-bugs, sdudhgao, shan
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-03 11:53:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Neha Berry 2021-05-19 14:24:07 UTC
Description of problem (please be detailed as possible and provide log
snippests):
====================================================================
Followed the disk replacement process in Vmware LSO using the Userv Interface using the steps from doc [1]

Due to presence of old symlink in the /mnt/local-storage/localblock folder, rook created a new PV and PVC on the orphaned symlink(device ) and hence the OSD-2 pod is now in CLBO state

The newly added replaced disk is still in Available state and was not used by rook to create a PVC and PV for replaced OSD


$ oc describe pod rook-ceph-osd-2-6954d6594f-q6gvv (shows the orphaned PV is used )
Events:
  Type     Reason                 Age                     From               Message
  ----     ------                 ----                    ----               -------
  Normal   Scheduled              129m                    default-scheduler  Successfully assigned openshift-storage/rook-ceph-osd-2-6954d6594f-q6gvv to compute-2
  Normal   SuccessfulMountVolume  129m                    kubelet            MapVolume.MapPodDevice succeeded for volume "local-pv-5cbeb665" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-5cbeb665"
  Normal   SuccessfulMountVolume  129m                    kubelet            MapVolume.MapPodDevice succeeded for volume "local-pv-5cbeb665" volumeMapPath "/var/lib/kubelet/pods/a3cb97cc-f45c-404e-934f-7635c1e250b1/volumeDevices/kubernetes.io~local-volume"


Events which led to the reproducer
====================================
Once initiated the replace disk:

- PVC deleted
- PV deleted (because reclaimPolicy is delete)
- PV recreated (because of stale symlink entry)
- new OSD provisioned on new bad PV
- a new PV created on newly added disk but stays in Available state


[1] - https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/replacing_devices/index?lb_target=preview#replacing-failed-storage-devices-on-vmware-and-bare-metal-infrastructures_rhocs


Version of all relevant components (if applicable):
=====================================================

LSO = local-storage-operator.4.6.0-202103010126.p0
OCS= ocs-operator.v4.6.4
OCP = 4.7.0-0.nightly-2021-05-17-040457


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
============================================================
Yes OSD is in CLBO and one has no doc update as of yet to understand what should be the next step

Is there any workaround available to the best of your knowledge?
==================================================================
Delete the symlink and repeat the replacement process again


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
===========================================================================
4


Can this issue reproducible?
=================================
Probably but it is a race


As per engg, Here is the scenario where it doesn't reproduce:
- PVC deleted
-  OSD provisioned on other PV
- PV deleted (because reclaimPolicy is delete)
- PV recreated (because of stale symlink entry)


Can this issue reproduce from the UI?
======================================
Disk Replacement process was UI based

If this is a regression, please provide more details to justify this:
===================================================================
Not a regression

Steps to Reproduce:
=======================
1. scale down an OSD, say osd-2 to induce a disk failure in OCS

oc scale -n openshift-storage deployment rook-ceph-osd-2 --replicas=0


2. Follow the steps in the docs to 

a) Click Home → Overview → Persistent Storage from the left navigation bar of the OpenShift Web Console.
b) Click Troubleshoot in the Disk <disk1> not responding or the Disk <disk1> not accessible alert.
c) On the Disks page, From the Action (⋮) menu of the failed disk, click Start Disk Replacement.

d) waited for the OpenShift Container Storage Status field to change to ReplacementReady
e) Once the status was ReplacementReady, in the vcenter removed the disk (for me, it pointed to /dev/sdb)
f) After ~5 mins, added a new disk of the same size (discovered as /dev/sdc)

3. Checked the pod status, pv status

oc get pods
oc get pv
oc describe pv <pv name>





Actual results:
=================
a) Observed that the ocs-osd-removal pod was in Completed state
b) A new PV and PVC with same name got created but used the old symlink, hence the resulting OSD pod was in CrashLoopBackOff state

c). Observed that a PV is created on the new disk but is not used

Expected results:
=====================
Once the template job delets the old PVC(which in turn delets the PV), rook should have created new PVC on the newly added disk and not on the old symlink

Comment 5 Sébastien Han 2021-05-24 16:08:05 UTC
Why is it under Rook if it's a doc bug? Thanks!

Comment 8 Jan Safranek 2021-05-25 14:24:40 UTC
Not sure how to fix it correctly. Can we stop creating a PV when a symlink points to a non-existing disk? In addition, should we delete the symlink if there is no PV for it? This may be tricky in 4.6.

Comment 9 Rohan CJ 2021-09-03 11:53:42 UTC
Closing this since the bug is low incidence and low consequence. Backporting wouldn't be possible. We'd have to either sync all previous releases with the latest (technically possible, but risky) or create a large tailored fix for the 4.6 branch.