Bug 1870061 - [RHEL][IBM] OCS un-install should make the devices raw
Summary: [RHEL][IBM] OCS un-install should make the devices raw
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: OCS 4.6.0
Assignee: Raghavendra Talur
QA Contact: Anna Sandler
URL:
Whiteboard:
Depends On: 1885648
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-19 09:44 UTC by akgunjal@in.ibm.com
Modified: 2020-12-17 06:24 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1887468 (view as bug list)
Environment:
Last Closed: 2020-12-17 06:23:47 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:24:01 UTC

Description akgunjal@in.ibm.com 2020-08-19 09:44:31 UTC
Description of problem (please be detailed as possible and provide log
snippests):
When we un-install OCS from our cluster, only the kubernetes resources on cluster are deleted. If we have used local volumes having Raw devices or partitions then the local volume paths on workers are not deleted. And the devices are also not erased and converted back to Raw devices.

Version of all relevant components (if applicable):
OCS 4.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
This impacts our ability to re-use the same cluster and local devices to install a fresh OCS.

Is there any workaround available to the best of your knowledge?
(1) Today we manually login to each worker node and remove the local volume path "/mnt/local-storage".
(2) For local device (example: /dev/sdc), we wipe the disk using command "sgdisk -Z /dev/sdc" command.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS using local volumes with local SSD disks on bare metal workers.
2. Un-install OCS.
3. Login to worker node and check the path "/mnt/local-storage" which has volumes. The disk is present on the worker which is not wiped.


Actual results:
The path "/mnt/local-storage" is present on workers and disk is not wiped.

Expected results:
The path "/mnt/local-storage" should be deleted and disks should be brought to same state as it was when the OCS install was initiated.

Additional info:

Comment 2 Neha Berry 2020-08-20 09:39:02 UTC
Hi ,

Were the uninstall steps from the official 4.4 docs followed to Clean up the cluster? Because we have explicit steps for wiping the disk and deleting the dataDir in the Worker Nodes


Doc link - https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.4/html-single/deploying_openshift_container_storage/index#assembly_uninstalling-openshift-container-storage_aws-vmware

Steps for cleaning up on the Node side:

7. Clean up the storage operator artifacts on each node.

8. Delete the local volume created during the deployment and for each of the local volumes listed in step 4.

9. Wipe the disks for each of the local volumes listed in step 4 so that they can be reused.


P.S: With 4.5, some of these steps are automated with Storagecluster Deletion , in case one uses the correct label for Cleanup policy. But the feature is available from OCS 4.5 onwards.

Comment 3 akgunjal@in.ibm.com 2020-08-20 11:05:50 UTC
Hi,

We use the uninstall doc link today and remove the path and wipe the disks as given in work-around of this issue. I was asking about automation of these steps should be done when OCS is uninstalled. If its supported in 4.5 then we are fine. Point me to the doc where its supported in 4.5 version.

The removal of local volume path and wipe of disk needs to be automated.

Comment 6 Elad 2020-10-07 09:25:28 UTC
In OCS 4.6 such flow should be handled automatically, hence, proposing as a blocker

Comment 7 Jose A. Rivera 2020-10-07 14:18:48 UTC
I don't think this qualifies as a blocker. Best I can tell this is not part of any MVP epic that was accepted during the planning phase. Just because it "should" work that way is not a reason to block the release.

It is entirely possible that this will not be achievable. Indeed, handling anything with the local devices other than the data on them is likely outside the scope of OCS.

Comment 8 Mudit Agarwal 2020-10-08 05:08:10 UTC
@Elad, as explained by Jose this is something we should not associate with OCS uninstall feature and can be taken up separately. I agree that we have to proceed manually in such setups but this is kind of an exception and should treated in that manner.  Looks like we need more time for this and it can't be done in 4.6 timeframe.

@Talur, do you want to add anything here?

Comment 9 Raghavendra Talur 2020-10-08 05:30:30 UTC
 
> @Talur, do you want to add anything here?

There are two requirements in the title.

1. Make the devices RAW
DONE

2. Removal of the local volume paths
As Jose also mentioned, this is probably outside the scope of the OCS components. We tested it recently and the paths are still left behind after uninstall, even in the case where LSO LocalVolumeSets are used.

Basically, if the install requires manual steps then uninstall would require too. (Install steps for local-storage - https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.5/html-single/deploying_openshift_container_storage_using_amazon_web_services/index#creating-openshift-container-storage-cluster-on-amazon-ec2_local-storage)

Comment 10 akgunjal@in.ibm.com 2020-10-08 07:01:27 UTC
@talur: My understanding is the devices are made RAW now upon uninstall so the devices are zapped. 

But the local volume paths on nodes are not removed. These paths were not created manually before install of OCS. They are created as part of OCS install and since OCS needs it to be empty, it needs to remove the directories of /mnt/local-storage as it got created automatically. Maybe OCS can uninstall cleanup any contents in those paths as it contains OCS related data.

Comment 11 Sahina Bose 2020-10-08 09:20:45 UTC
2. Removal of the local volume paths
Is this something LSO should handle then?  On removal of PV or uninstall of LSO?

Comment 12 Raghavendra Talur 2020-10-08 15:23:46 UTC
(In reply to Sahina Bose from comment #11)
> 2. Removal of the local volume paths
> Is this something LSO should handle then?  On removal of PV or uninstall of
> LSO?

IMO, yes it is LSO which creates it and can delete it. Sahina, who can help with that?

Comment 13 Santosh Pillai 2020-10-08 17:23:43 UTC
The local volume paths referred here are just symlinks that LocalVolume/LocalVolumeSet creates. The provisioner (sig-storage-local-storage-provisioner) picks up these symlinks and provisions PVs out of them.

I'm also of the opinion that these symlinks (/mnt/local-storage/<storageclassname>/<symlink>) should be delete by LSO and not by OCS.  

OCS is not directly controlling the localvolumeset/localvolume and thus not deleting the localVolumeSet/localVolume on its deletion. So if OCS decides to delete the symlinks (/mnt/local-storage/<storageclassname>/<symlink>) or the entire storageclass directory (/mnt/local-storage/<storageclassname>), then these will get created again because localVolume/localVolumeSet daemons are still running.

Comment 14 Raghavendra Talur 2020-10-08 17:29:24 UTC
(In reply to Santosh Pillai from comment #13)
> The local volume paths referred here are just symlinks that
> LocalVolume/LocalVolumeSet creates. The provisioner
> (sig-storage-local-storage-provisioner) picks up these symlinks and
> provisions PVs out of them.
> 
> I'm also of the opinion that these symlinks
> (/mnt/local-storage/<storageclassname>/<symlink>) should be delete by LSO
> and not by OCS.  
> 
> OCS is not directly controlling the localvolumeset/localvolume and thus not
> deleting the localVolumeSet/localVolume on its deletion. So if OCS decides
> to delete the symlinks (/mnt/local-storage/<storageclassname>/<symlink>) or
> the entire storageclass directory (/mnt/local-storage/<storageclassname>),
> then these will get created again because localVolume/localVolumeSet daemons
> are still running.


Based on this information, I suggest that we split this bug into two.

1. Cleanup the disks and make them RAW
Component remains the same(ocs-operator) and it is fixed.

2. File a new bug on LSO asking for the local volume paths to be removed.


I will do this tomorrow after waiting for a day to see if there are any objections.

Comment 15 Jose A. Rivera 2020-10-09 13:13:55 UTC
No complaints on my part!

Comment 16 Raghavendra Talur 2020-10-12 14:48:38 UTC
I have created a new bug to track the LSO changes required to remove the symlinks under /mnt/local-storage : https://bugzilla.redhat.com/show_bug.cgi?id=1887468

Renaming this bug to track only the cleanup of the disks.

Comment 17 Raghavendra Talur 2020-10-12 14:55:27 UTC
PR to cleanup the disks was merged in rook in 4.5 https://github.com/rook/rook/pull/5545
PR to make cleanup the default was merged in ocs operator with the first build of 4.6 https://github.com/openshift/ocs-operator/pull/731

Comment 21 Neha Berry 2020-11-26 12:49:14 UTC
Hi talur,

IIUC, wipefs as part of storagecluster deletion is the actual fix for this bug as is independent of platform

But this BZ was raised in OCS 4.4 on IBM, hence wanted to confirm if this BZ needs to be verified on IBM as well or any platform will do ?

@akgunjal.com , atleast for IBM platform, can we request you to also verify from your end too, if possible?

Comment 22 Anna Sandler 2020-12-08 12:22:16 UTC
Moving to verified. was tested on ASW+LSO and devices are becoming raw after uninstall

Comment 24 errata-xmlrpc 2020-12-17 06:23:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605


Note You need to log in before you can comment on or make changes to this bug.