2182375 – [MDR] Not able to fence DR clusters

Bug 2182375 - [MDR] Not able to fence DR clusters

Summary: [MDR] Not able to fence DR clusters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-addons
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Niels de Vos
QA Contact:	Parikshith
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-03-28 13:12 UTC by Parikshith
Modified:	2023-08-09 16:37 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-21 15:25:01 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	csi-addons kubernetes-csi-addons pull 333	None	Merged	NetworkFence: correct check in validating webhook	2023-03-28 15:14:28 UTC
Github	red-hat-storage kubernetes-csi-addons pull 79	None	Merged	BUG 2182375: NetworkFence: correct check in validating webhook	2023-03-28 15:15:41 UTC
Red Hat Product Errata	RHBA-2023:3742	None	None	None	2023-06-21 15:25:26 UTC

Description Parikshith 2023-03-28 13:12:11 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On 4.13 MDR setup, not able to fence the DR clusters. It will get stuck in 'fencing' state.

oc describe networkfence network-fence-pbyregow-c1
  Message:  failed to add finalizer (csiaddons.openshift.io/network-fence) to NetworkFence resource (network-fence-pbyregow-c1): admission webhook "vnetworkfence.kb.io" denied the request: NetworkFence.csiaddons.openshift.io "network-fence-pbyregow-c1" is invalid: spec.parameters: Invalid value: map[string]string{"clusterID":"openshift-storage"}: parameters cannot be changed

Version of all relevant components (if applicable):
ocp: 4.13.0-0.nightly-2023-03-23-204038
odf: 4.13.0-110
acm: 2.7.2 
mco: 4.13.0-110

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, cannot failover without fencing

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
2/2 times 
First noticed on 4.13.0-108
Reproduced on 4.13.0-110

If this is a regression, please provide more details to justify this:
Yes, works on 4.12.1 and 4.12.2 MDR configs

Steps to Reproduce:
1. Create a Metro-DR cluster with 3 OCP clusters, ie hub, c1, and c2
2. Configure dr policy and fencing 
3. Create an application on the managed cluster, c1
4. Fence c1

Steps 1-4 are done by following doc [1]

[1] https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html-single/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/index?lb_target=preview#configure-drclusters-for-fencing-automation

Actual results:
c1 cluster will be stuck in fencing state

Expected results:
Cluster should be moved to fenced state

Additional info:

Comment 4 Niels de Vos 2023-03-28 14:30:10 UTC

The error comes from https://github.com/csi-addons/kubernetes-csi-addons/blob/main/apis/csiaddons/v1alpha1/networkfence_webhook.go#L64-L66

```
    if reflect.DeepEqual(n.Spec.Parameters, oldNetworkFence.Spec.Parameters) {
        allErrs = append(allErrs, field.Invalid(field.NewPath("spec").Child("parameters"), n.Spec.Parameters, "parameters cannot be changed"))
    }
```

So, if reflect.DeepEqual() returns true, the error is returned? I think it misses a !

Comment 7 Sravika 2023-03-29 07:40:56 UTC

Also observed this on IBM Z, fencing of the DR cluster was not successful.

  - lastTransitionTime: "2023-03-28T19:56:13Z"
    message: fencing operation not successful
    observedGeneration: 5
    reason: FenceError
    status: "False"
    type: Fenced
  - lastTransitionTime: "2023-03-28T19:56:13Z"
    message: fencing operation not successful
    observedGeneration: 5
    reason: FenceError
    status: "True"
    type: Clean




2023-03-28T19:58:58.950Z        INFO    controllers.DRCluster   controllers/drcluster_controller.go:290 Nothing to update {Phase:Fencing Conditions:[{Type:Fenced Status:False ObservedGeneration:5 LastTransitionTime:2023-03-28 19:56:13 +0000 UTC Reason:FenceError Message:fencing operation not successful} {Type:Clean Status:True ObservedGeneration:5 LastTransitionTime:2023-03-28 19:56:13 +0000 UTC Reason:FenceError Message:fencing operation not successful} {Type:Validated Status:True ObservedGeneration:5 LastTransitionTime:2023-03-28 19:56:13 +0000 UTC Reason:Succeeded Message:Validated the cluster}]}  {"name": "ocsm4205001", "rid": "b81d5c34-1531-4c46-a9f3-fa9ffe1aed39"}
2023-03-28T19:58:58.950Z        INFO    controllers.DRCluster   controllers/drcluster_controller.go:149 reconcile exit  {"name": "ocsm4205001", "rid": "b81d5c34-1531-4c46-a9f3-fa9ffe1aed39"}
2023-03-28T19:58:58.950Z        ERROR   controller/controller.go:326    Reconciler error        {"controller": "drcluster", "controllerGroup": "ramendr.openshift.io", "controllerKind": "DRCluster", "DRCluster": {"name":"ocsm4205001"}, "namespace": "", "name": "ocsm4205001", "reconcileID": "64d82e9d-5b37-4ad8-965d-d519732d9bad", "error": "failed to handle cluster fencing: fencing operation result not successful"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.1/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.1/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.1/pkg/internal/controller/controller.go:234

Comment 8 Madhu Rajanna 2023-03-29 08:08:57 UTC

@Sarakia, in which ODF version you have seen this issue? is it ODF 4.13?

Comment 9 Sravika 2023-03-29 08:28:21 UTC

@mrajanna : Its ODF version v4.13.0-110.stable, same as mentioned n the BZ

Comment 10 Niels de Vos 2023-03-29 13:51:09 UTC

@sbalusu do you get the same error when you 'oc describe' the networkfence CR?

A fix for this should be included in the next ODF build.

Comment 11 Mudit Agarwal 2023-03-29 13:57:14 UTC

Fixed with 4.13.0-114

Comment 17 errata-xmlrpc 2023-06-21 15:25:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.