Bug 1872755

Summary: [RFE][External mode] Re-sync OCS and RHCS for any changes on the RHCS cluster that will affect the OCS cluster
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rachael <rgeorge>
Component: ocs-operatorAssignee: Mudit Agarwal <muagarwa>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: muagarwa, nberry, ocs-bugs, odf-bz-bot, owasserm, sostapov
Target Milestone: ---Keywords: AutomationBackLog, FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-01 13:24:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rachael 2020-08-26 14:50:03 UTC
Description of problem (please be detailed as possible and provide log
snippets):

An external mode OCS cluster is dependent on the resources that are made available from the RHCS cluster, like cephblock pool, cephfs pools and RGW endpoint. The storageclasses in OCS are created based on the json output which contains these details. In case of a change in any of these resources, caused maybe due to some disruptions on the RHCS side,the existing resource in OCS will be affected by it. Some of these failures can be:

  - Addition/Deletion of cephfs pools
  - Addition/Deletion of cephblock pool being used by OCS
  - Addition/Deletion of RGW endpoints

The request here is to re-configure the OCS resources like the storageclasses to be able to use the updated parameters from the RHCS cluster, without having to re-deploy OCS.  


Version of all relevant components (if applicable):
OCS version: 4.5.0-526.ci
OCP version: 4.5.0-0.nightly-2020-08-20-051434

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Additional info:
As a disruptive test case, the cephblock pool was deleted and re-created with the same name. Although old PVCs were not accessible, we were able to provision new PVCs

Comment 3 Travis Nielsen 2020-08-26 17:33:26 UTC
OCS does not control the external cluster. If the admin of the external cluster does destructive actions like deleting a pool, there is nothing OCS can do to recover the PVCs using that pool. As you noticed, new volumes can at least be created. 

Moving the RFE out to 4.7 for discussion. But this is a very broad request. We will really need to evaluate individual scenarios to see if they can even be supported.

In some scenarios, the admin could delete the OCS initialization CR and the issue could be resolved by re-creating the storage class. But in other cases there may be nothing we can do. We need specific scenarios. 

In some scenarios, the Rook CRs can be updated. In other cases, they may be OCS CRs to be updated. Either way, in OCS this would be driven by the OCS operator. 

Moving to the OCS component in case Jose sees a more general solution, but I would recommend this general BZ be closed and instead track more specific scenarios individually.

Comment 4 Raz Tamir 2020-08-26 18:23:18 UTC
Maybe I'm missing the point of this RFE but if there are changes in the cluster resources that are not reflected on OCS side, isn't it the most important thing in the external mode?

Comment 5 Neha Berry 2020-08-26 18:44:02 UTC
(In reply to Travis Nielsen from comment #3)
> OCS does not control the external cluster. If the admin of the external
> cluster does destructive actions like deleting a pool, there is nothing OCS
> can do to recover the PVCs using that pool. As you noticed, new volumes can
> at least be created. 
> 
New volumes can be created only in the case the re-created pool has the same old name. Else, we cannot create new PVCs too. So, we should have a way to reconfigure/re-initialize SC with the new pool name(in case pool name is diff than original), to enable new PVC creation.

As regards old volumes, we agree that we cannot expect to use them if underlying pool is deleted. But at-least for noobaa-db PV, there should be a way to recover noobaa without OCS re-install. 

Same issue will be seen for internal mode too, in case the noobaa-db PV has some issues. We should have a way to recover Noobaa (even if old data is lost, new things should work - that can only happen if we have a way to recover noobaa in case its DB is lost)


> Moving the RFE out to 4.7 for discussion. But this is a very broad request.
> We will really need to evaluate individual scenarios to see if they can even
> be supported.
> 
> In some scenarios, the admin could delete the OCS initialization CR and the
> issue could be resolved by re-creating the storage class. But in other cases
> there may be nothing we can do. We need specific scenarios. 
> 
> In some scenarios, the Rook CRs can be updated. In other cases, they may be
> OCS CRs to be updated. Either way, in OCS this would be driven by the OCS
> operator. 
> 
> Moving to the OCS component in case Jose sees a more general solution, but I
> would recommend this general BZ be closed and instead track more specific
> scenarios individually.

Comment 24 Mudit Agarwal 2023-08-01 13:24:13 UTC
Don't think this will ever be prioritized.