Bug 1885692 - GCE Disks for Ceph are not removed when the whole cluster is destroyed
Summary: GCE Disks for Ceph are not removed when the whole cluster is destroyed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: documentation
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Kusuma
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-06 18:28 UTC by Martin Bukatovic
Modified: 2021-08-25 14:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-25 14:55:06 UTC
Embargoed:


Attachments (Terms of Use)
screenshot #1: list of GCE Disks while the OCP/OCS cluster is running (249.24 KB, image/png)
2020-10-06 18:32 UTC, Martin Bukatovic
no flags Details
screenshot #2: list of GCE Disks after the cluster was destroyed (115.35 KB, image/png)
2020-10-06 18:32 UTC, Martin Bukatovic
no flags Details

Description Martin Bukatovic 2020-10-06 18:28:44 UTC
Description of problem
======================

When I remove OCP cluster using `openshift-install destroy cluster --dir=...`,
I see that GCE Disks which were deployed during StorageCluster installation
to host data of ceph OSD and MON components are still there.

Version-Release number of selected component
============================================

OCP 4.6.0-0.nightly-2020-10-03-051134
OCS 4.6.0-583.ci

How reproducible
================

4/4

Steps to Reproduce
==================

1. Install OCP cluster on GCP platform
2. Install OCS (following the docs, using "faster" ssd storage class)
3. Destroy the whole cluster via `openshift-install destroy cluster --dir=...`
4. Observe GCP project where the cluster was installed

Actual results
==============

When I go to "Disks" page of "Compute Engine" section for the GCP project
where the cluster was installed, I still see Google Compute Engine Disks for
each OSD and MON the cluster were using.

See attached screenshot.

Expected results
================

All GCE Disks are removed after cluster destroy.

Comment 2 Martin Bukatovic 2020-10-06 18:32:02 UTC
Created attachment 1719474 [details]
screenshot #1: list of GCE Disks while the OCP/OCS cluster is running

Comment 3 Martin Bukatovic 2020-10-06 18:32:55 UTC
Created attachment 1719475 [details]
screenshot #2: list of GCE Disks after the cluster was destroyed

Comment 4 Sébastien Han 2020-10-07 08:50:25 UTC
What's the SC class on deletion? Typically, we do not remove the OSD disks but we do remove the mons.
Moving to ocs-op

Comment 5 Martin Bukatovic 2020-10-07 09:10:14 UTC
The storage class used for the Mon and OSD devices is manually created prior OCS installation to allow OCS to use Azure SSD disks:

```
$ cat storageclass.yaml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
 name: faster
provisioner: kubernetes.io/gce-pd
parameters:
 type: pd-ssd
volumeBindingMode: WaitForFirstConsumer
```

This process is described in "Deploying and managing OpenShift Container Storage using Google Cloud" documentation, section 1.2. "Creating an OpenShift Container Storage Cluster Service in internal mode":

https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.5/html/deploying_and_managing_openshift_container_storage_using_google_cloud/deploying-openshift-container-storage-on-google-cloud_gcp#creating-an-openshift-container-storage-service_gcp

Comment 6 Jose A. Rivera 2020-10-12 14:52:36 UTC
This is not a blocker for OCS 4.6, moving to OCS 4.7.

As Seb said, the OSD devices won't be deleted, that has to be done by the admin. I'm not aware of what Rook is expected to do when the mons are removed. OCS Operator has never dealt with removing any devices. Seb?

Comment 7 Sébastien Han 2020-10-12 16:39:21 UTC
It's all about retain policy of the SC.
If ocs-op does not create the SC then it's a doc improvement maybe?

@Jose nothing special :).

Comment 8 Travis Nielsen 2020-10-12 17:01:03 UTC
Agreed, if the retain policy of the storage class is to delete, then we expect them to be deleted. The storage class (or the default) must have set "retain", so this is not an OCS or Rook issue except perhaps documentation.

Comment 9 Martin Bukatovic 2020-10-26 18:15:58 UTC
I can confirm that when I use predefined "standard" storage class instead of the custom SSD one as currently explained in OCS docs, google disks are removed as expected during cluster teardown.

Comment 10 Martin Bukatovic 2020-10-26 18:18:49 UTC
Based on the dev evaluation, and qe confirmation, moving to the documentation component. A custom storage class we instruct admins to create in our docs is to blame here.

Comment 11 Martin Bukatovic 2020-10-26 18:21:46 UTC
Could we gent a dev approved fix for the "faster" storage class as currently listed in the docs? Do I read it right that we should set `reclaimPolicy` to `Delete`?

Comment 12 Travis Nielsen 2020-10-26 18:38:10 UTC
Yes, I would expect the reclaim policy to be delete. If you delete the CephCluster CR, it's going to be very difficult to recover your cluster at that point anyway.

Comment 13 Martin Bukatovic 2020-11-24 16:04:53 UTC
Validation of the proposed fix: I installed CI build of OCS 4.6 manually, adding
`reclaimPolicy: Delete` line into yaml definition of the ssd storage class
(as defined in our by documentation[1]):

```
$ cat gcp-sc.bz-1885692.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
 name: faster
provisioner: kubernetes.io/gce-pd
parameters:
 type: pd-ssd
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
$ diff gcp-sc.46.yaml gcp-sc.bz-1885692.yaml
8a9
> reclaimPolicy: Delete
$ oc create -f gcp-sc.bz-1885692.yaml
```

And I see that it has the expected effect, after cluster teardown, there are no
leftover google disks.

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/deploying_and_managing_openshift_container_storage_using_google_cloud/index?lb_target=preview

Comment 14 Martin Bukatovic 2020-11-24 16:20:07 UTC
FYI pending change in ocs-ci deployment automation: https://github.com/red-hat-storage/ocs-ci/pull/3397

Comment 16 Martin Bukatovic 2021-02-09 20:01:06 UTC
Referenced preview of "Deploying and managing OpenShift Container Storage using Google Cloud" guide contains ssd-storeageclass.yaml example with reclaimPolicy set to Delete as expected.

Verified.


Note You need to log in before you can comment on or make changes to this bug.