2263441 – Customer Managed to Create Two StorageSystems w/the Original StorageSystem Now in a "Terminating" State (Data Loss Prevention)

Bug 2263441 - Customer Managed to Create Two StorageSystems w/the Original StorageSystem Now in a "Terminating" State (Data Loss Prevention)

Summary: Customer Managed to Create Two StorageSystems w/the Original StorageSystem No...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nitin Goyal
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-08 22:35 UTC by Craig Wayman
Modified:	2024-03-04 13:43 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-04 09:01:01 UTC
Embargoed:

Attachments	(Terms of Use)

Description Craig Wayman 2024-02-08 22:35:48 UTC

Created attachment 2015969 [details]
Two StorageSystems

Description of problem (please be detailed as possible and provide log snippets):

  The current state of ODF is it’s still able to provision PVCs, it actually seems to be operating just fine, however, the customer had a mirroring issue in their disconnected environment and upon further inspection, they thought the root cause of their issue was emanating from the fact they had two storagesystems (one deleting/being held up by NooBaa finalizer). After some tests, ODF Support confirmed that this wasn’t the case and that ODF is functioning, but this situation does need to be remedied. 

  ODF Support is opening this BugZilla case for two reasons. The first reason is the customer believes this to be a bug. They stand firmly that this was automated and wasn't executed by their staff (creating the second storagesystem and deleting the initial storagesystem), however, ODF Support’s research disagrees with this assertion; but is worth pursuing if their statements are true. 

  The second reason is that this first/initial storagesystem will need to be reconciled/deleted/uninstalled. ODF Support is comfortable with the patching of finalizers to delete/purge this storagesystem that is being deleted, however, there are some unknowns as to whether this will cause data loss so we’re seeking the guidance of Engineering to facilitate this process and green light Support’s steps.

  Regarding a more descriptive analysis of the problem along with log snippets. ODF Support’s conclusion after analyzing the must-gathers/logs will be in the Private Comment of this BZ. Note that the analysis was given to the customer as a case comment as well.   


Version of all relevant components (if applicable):


OCP:
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.27   True        False         137d    Cluster version is 4.12.27


ODF:
NAME                                    DISPLAY                       VERSION        REPLACES                                PHASE
mcg-operator.v4.12.7-rhodf              NooBaa Operator               4.12.7-rhodf   mcg-operator.v4.12.6-rhodf              Succeeded
ocs-operator.v4.12.7-rhodf              OpenShift Container Storage   4.12.7-rhodf   ocs-operator.v4.12.6-rhodf              Succeeded
odf-csi-addons-operator.v4.12.7-rhodf   CSI Addons                    4.12.7-rhodf   odf-csi-addons-operator.v4.12.6-rhodf   Succeeded
odf-operator.v4.12.7-rhodf              OpenShift Data Foundation     4.12.7-rhodf   odf-operator.v4.12.6-rhodf              Succeeded
quay-operator.v3.8.11                   Red Hat Quay                  3.8.11         quay-operator.v3.8.10                   Succeeded

Ceph:
{
    "mon": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 6
    },
    "mds": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.10-187.el8cp (5d6355e2bccd18b5c6457a34cb666d773f21823d) pacific (stable)": 12
    }
}


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Doesn’t look to have any impact, just worried about data loss when getting rid of the first storagesystem.



Is there any workaround available to the best of your knowledge?

Deleting the storagesystem that is in the “Terminating” state. Unsure if this will cause data loss.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue be reproducible?

Yes

Can this issue reproduce from the UI?

No


Steps to Reproduce:

1. With the First storagesystem created deployed/healthy, create a second storagesystem using the steps outlined in: https://access.redhat.com/articles/5692201#create-cluster-11

2. Note, this CR needs to have the EXACT configuration as the first storagesytem, but a different name.

3. Create the storagesystem with $ oc create -f storagecluster.yaml

4. This will place the newly created in a progressing/creating state as it can’t create due to a storagesystem/storagecluster already existing.

5. Delete the first/original storagesytem.

6. The first/original storagesystem will most likely get stuck in “Terminating/Deleting” state being hung up by a finalizer/admissions web hook (e.g. NooBaa).

7. Because the first/original storagesystem is stuck in a “Terminating” state, ODF then allows the creation/reconciliation of the second/newly created storagesystem to transition from “Progressing/Creating” to provisioned.

Note You need to log in before you can comment on or make changes to this bug.