Bug 1887574

Summary: Deleting LocalVolumeSet+StorageClass+PVs doesn't delete associated symlinks
Product: OpenShift Container Platform Reporter: Jeremy Poulin <jpoulin>
Component: StorageAssignee: aos-storage-staff <aos-storage-staff>
Storage sub component: Local Storage Operator QA Contact: Qin Ping <piqin>
Status: CLOSED DEFERRED Docs Contact:
Severity: low    
Priority: unspecified CC: aos-bugs, hekumar, Holger.Wolf, jsafrane, lakshmi.ravichandran1, muagarwa, nbziouec, rtalur, sabose, sorth, tdale
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-13 14:43:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oc describe node master-03.txt none

Description Jeremy Poulin 2020-10-12 20:12:14 UTC
Description of problem:
If you create a LocalVolumeSet - for example named local-ext4, and it discovers one or more disks, it will successfully match the desired nodes and provision the PVs. However, if you delete the LocalVolumeSet, PersistentVolumes, and StorageClasses from your cluster, the symlinks created in /mnt/local-storage/local-ext4 don't get deleted.

Ultimately, this means that if you were to create a new LocalVolumeSet, say local-blk, future nodes don't get provisioned.

Version-Release number of selected component (if applicable):
4.6.0-202010061132.p0

Expected results:
When a LocalVolumeSet is removed, the symlinks should be removed from /mnt/local-storage/<name>.

Comment 2 Jan Safranek 2020-10-13 14:36:13 UTC
*** Bug 1887468 has been marked as a duplicate of this bug. ***

Comment 3 Jan Safranek 2020-10-13 14:43:26 UTC
We're planning to solve it as part of https://issues.redhat.com/browse/STOR-390, however, the number of requested changes is quite big and we need to redesign how the whole LSO daemonset works.

Comment 4 Lakshmi Ravichandran 2020-10-14 07:46:57 UTC
Hi, I came across the document section on "Deleting the Local Storage Operator’s resources -> Removing a local volume "" in the 4.5 Openshift documents,
where it is recommended to delete the symlinks manually by logging oneself to the node as a root user while deleting the Local volumes.

Would that play a role here or do we expect any changes to these instructions as result of this bug.
How would this look like in Openshift documents for 4.6.

reference: https://docs.openshift.com/container-platform/4.5/storage/persistent_storage/persistent-storage-local.html#deleting-the-local-storage-operators-resources

Document section is as below:
--------------------------------
Edit the previously created LocalVolume to remove any unwanted disks.

Edit the cluster resource:
$ oc edit localvolume <name> -n local-storage
Navigate to the lines under devicePaths, and delete any representing unwanted disks.
Delete any PersistentVolumes created.

$ oc delete pv <pv-name>
Delete any symlinks on the node.
Create a debug pod on the node:

$ oc debug node/<node-name>
Change your root directory to the host:

$ chroot /host
Navigate to the directory containing the local volume symlinks.

$ cd /mnt/local-storage/<sc-name> 
The name of the StorageClass used to create the local volumes.
Delete the symlink belonging to the removed device.

$ rm <symlink>

Thanks in advance !!

Comment 5 Jan Safranek 2020-10-14 08:53:12 UTC
4.6 is the same as 4.5 in this regard, still manual deletion of symlinks.

Comment 6 Lakshmi Ravichandran 2020-10-14 15:25:04 UTC
Hi, in OCP 4.6, I tried by manually deleting the symlinks before re-creating the LocalVolume with another storageClassName.
I could see the PV in the new node added to the LocalVolume definition is not being picked up by the cluster.
It always displays the old list of PVs as earlier from the first LocalVolume - deployment.

Neither, deleting the entire dir - /mnt/local-storage in the nodes helped.

I followed the below steps,

oc version
Client Version: 4.6.0-0.nightly-s390x-2020-10-06-145952
Server Version: 4.6.0-0.nightly-s390x-2020-10-10-041058
Kubernetes Version: v1.19.0+d59ce34

Added secondary dasd disks to worker-001, worker-002, master-03 nodes.

Installed the local storage operator using iib-pub-pending:v4.6

apiVersion: v1
kind: Namespace
metadata:
  name: local-storage
---
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: local-operator-group
  namespace: local-storage
spec:
  targetNamespaces:
    - local-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: local-storage-operator
  namespace: local-storage
spec:
  channel: "4.6"
  installPlanApproval: Automatic
  name: local-storage-operator
  source: redhat-operators-multi-arch-qe
  sourceNamespace: openshift-marketplace

oc get csv -A
NAMESPACE                              NAME                                           DISPLAY          VERSION                 REPLACES   PHASE
local-storage                          local-storage-operator.4.6.0-202010061132.p0   Local Storage    4.6.0-202010061132.p0              Succeeded
openshift-operator-lifecycle-manager   packageserver                                  Package Server   0.16.1                             Succeeded

Create the LocalVolume

apiVersion: "local.storage.openshift.io/v1"
kind: "LocalVolume"
metadata:
  name: "local-dasd-worker-master"
  namespace: "local-storage"
spec:
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-001.ocp-t8359015.lnxne.boe
          - worker-002.ocp-t8359015.lnxne.boe
  storageClassDevices:
    - storageClassName: "local-sc-worker-master"
      volumeMode: Filesystem
      fsType: ext4
      devicePaths:
        - /dev/dasdb1

# oc get sc -A
NAME                     PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-sc-worker-master   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  45s

#oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS             REASON   AGE
local-pv-205a7997   20Gi       RWO            Delete           Available           local-sc-worker-master            42s
local-pv-7d75daac   20Gi       RWO            Delete           Available           local-sc-worker-master            47s

Now, delete the LocalVolume , pvs 
remove the symlinks by logging in as root user to the worker nodes


create the local volume now to include the device path from master-03 also

apiVersion: "local.storage.openshift.io/v1"
kind: "LocalVolume"
metadata:
  name: "local-dasd-worker-master"
  namespace: "local-storage"
spec:
  nodeSelector:
    nodeSelectorTerms:
    - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-001.ocp-t8359015.lnxne.boe
          - worker-002.ocp-t8359015.lnxne.boe
          - master-03.ocp-t8359015.lnxne.boe
  storageClassDevices:
    - storageClassName: "local-sc-worker-master"
      volumeMode: Filesystem
      fsType: ext4
      devicePaths:
        - /dev/dasdb1

create the LocalVolume and see the same list of PVs as before.
The PV from master-03  is not discovered and included in the list which should be the ideal case.

Neither, creating the LocalVolume with another storage class name nor deleting the LSO and redeploying the LocalVolume helped.
I tried also installing LSO to openshift-local-storage namespace but it didn't help.

Comment 7 Hemant Kumar 2020-10-14 15:30:37 UTC
You likely have LSO not running on master node because it requires special toleration for that.

Comment 8 Lakshmi Ravichandran 2020-10-14 16:15:35 UTC
at this state, the node master-03 has only one taint as below,

Taints:             node-role.kubernetes.io/master:NoSchedule

no other taints or tolerations set on the master-03 node.

I am attaching the output of oc describe node master-03.ocp-t8359015.lnxne.boe for more clarity.

Comment 9 Lakshmi Ravichandran 2020-10-14 16:16:56 UTC
Created attachment 1721529 [details]
oc describe node master-03.txt

Comment 10 Hemant Kumar 2020-10-14 17:26:52 UTC
Yes that is what I was talking about. You need to add toleration for that in LocalVolume object. Please see - https://docs.openshift.com/container-platform/4.5/storage/persistent_storage/persistent-storage-local.html#local-tolerations_persistent-storage-local

Comment 11 Lakshmi Ravichandran 2020-10-15 18:07:16 UTC
The section at https://docs.openshift.com/container-platform/4.5/storage/persistent_storage/persistent-storage-local.html#local-tolerations_persistent-storage-local 
talks about including tolerations in general to LocalVolume.

--------------------------------------------------------------------------------------------------------------
I find the documents fails to provide clarity to have local PVs in master-nodes in two main aspects as below,
--------------------------------------------------------------------------------------------------------------

Aspect-I:
---------

Inorder to have local PV in master node, the LocalVolume definition should explicity include the below toleration

- key: node-role.kubernetes.io/master
  operator: Exists

If it is not included the diskmaker, provisioner pods are not scheduled on the particular master and eventually the underlying PV is not located.

As a default OCP cluster's master nodes are mastersSchedulable=false, the user has to include this toleration inevitably.
But, the documentation fails to include any note on the needed toleration to be set in the LocalVolume for having the local PVs on master nodes.

--------------------------------------------------------------------------------------------------------------

Aspect-II:
----------

The section "2. Optional: Allow local storage creation on master and infrastructure nodes."
https://docs.openshift.com/container-platform/4.3/storage/persistent_storage/persistent-storage-local.html#local-storage-install_persistent-storage-local

The position of these instructions in the docs is mis-leading as it asks the user to apply the patch too early at a stage where the local storage operator is not even installed.

"oc get ds" at this point returns zero result and one cannot apply these patches.

And as I observed, 
oc ds resources are created only when LocalVolume is deployed and not when LSO operator is installed.
In general, applying the patches at this point helps the diskmaker, provisioner pods to be scheduled on master nodes and local PV is identified.

However, the patches donot have an effect when the LocalVolume already tolerates node-role.kubernetes.io/master - Exists.
I could see the local PVs on master nodes being displayed even before the patch is applied.


Expected Results:
-----------------
1. The documentation should include clearer notes on the tolerations to be set to have local PVs in master nodes.
2. The section "2. Optional: Allow local storage creation on master and infrastructure nodes." has to be removed from the existing place and included later in the section towards "Using tolerations with Local Storage Operator Pods".

If needed, I shall create a separate bugzilla entry for the same.

Thank you!!

Comment 12 Hemant Kumar 2020-10-15 21:04:16 UTC
Yes, I think documentation about taints for LocalVolume CRs can be clarified further. Please open a doc bug for that.

Comment 13 Sahina Bose 2020-11-23 10:36:07 UTC
(In reply to Jan Safranek from comment #3)
> We're planning to solve it as part of
> https://issues.redhat.com/browse/STOR-390, however, the number of requested
> changes is quite big and we need to redesign how the whole LSO daemonset
> works.

Now that STOR-390 is being worked on, can we consider this for 4.7?

Comment 14 Mudit Agarwal 2021-01-22 12:09:11 UTC
https://issues.redhat.com/browse/STOR-390 is feature complete, should we assume this issue to be fixed?

Comment 15 Jan Safranek 2021-01-26 15:50:30 UTC
Sorry for late reply. I think https://issues.redhat.com/browse/STOR-390 does not solve cleanup of the symlinks, that's https://issues.redhat.com/browse/STOR-455. And that one is not available in 4.7.