Back to bug 2116605

Who When What Removed Added
Shyamsundar 2022-08-09 02:45:41 UTC Status NEW ASSIGNED
Depends On 2097511
Ilya Dryomov 2022-08-09 13:51:55 UTC CC idryomov
Mudit Agarwal 2022-08-20 02:59:13 UTC Blocks 2119932
Mudit Agarwal 2022-08-20 03:00:23 UTC Status ASSIGNED POST
Link ID Github RamenDR/ramen/pull/525
Shyamsundar 2022-08-23 13:59:25 UTC Doc Text Cause: Due to a bug in the DR reconciler, during deletion of internal VolumeReplicaitonGroup resource on a managed cluster, from where a workload was failed over or relocated from, a PVC is attempted to be protected. The resulting cleanup operation does not complete and report PeerReady condition on the DRPlacementControl for the the application

Consequence: An application that was failed over or relocated, cannot be relocated or failed over again due to DRPlacementControl resource reporting its PeerReady condition as false

Workaround (if any):
Before applying the workaround, determine the cause is due to protecting a PVC during VolumeReplicationGroup deletion as follows:
- Ensure the VolumeReplicationGroup resource in the workload namespace on the managed cluster from where it was relocated or failed over from has the following values:
- VRG metadata.deletionTimestamp is non-zero
- VRG spec.replicationState is "Secondary"
- List VolumeReplication resources in the workload namespace as above, and ensure the resource has the following values:
- metadata.generation is 1
- spec.replicationState is "Secondary"
- The VolumeReplication resource reports no status
- For each VolumeReplication resource in the above state, their corresponding PVC resource (as seen in the VRG spec.dataSource field) should have the following values:
- metadata.deletionTimestamp is non-zero

To recover,
- Remove the finalizer "volumereplicationgroups.ramendr.openshift.io/vrg-protection" from the VRG resource
- Remove the finalizer "volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection" from the respective PVC resources

Result: DRPlacementControl at the hub cluster reports PeerReady condition as "true" and enables further workload relocation or failover actions
Mudit Agarwal 2022-08-23 14:20:20 UTC Blocks 2094357
Olive Lakra 2022-08-23 14:55:43 UTC Doc Text Cause: Due to a bug in the DR reconciler, during deletion of internal VolumeReplicaitonGroup resource on a managed cluster, from where a workload was failed over or relocated from, a PVC is attempted to be protected. The resulting cleanup operation does not complete and report PeerReady condition on the DRPlacementControl for the the application

Consequence: An application that was failed over or relocated, cannot be relocated or failed over again due to DRPlacementControl resource reporting its PeerReady condition as false

Workaround (if any):
Before applying the workaround, determine the cause is due to protecting a PVC during VolumeReplicationGroup deletion as follows:
- Ensure the VolumeReplicationGroup resource in the workload namespace on the managed cluster from where it was relocated or failed over from has the following values:
- VRG metadata.deletionTimestamp is non-zero
- VRG spec.replicationState is "Secondary"
- List VolumeReplication resources in the workload namespace as above, and ensure the resource has the following values:
- metadata.generation is 1
- spec.replicationState is "Secondary"
- The VolumeReplication resource reports no status
- For each VolumeReplication resource in the above state, their corresponding PVC resource (as seen in the VRG spec.dataSource field) should have the following values:
- metadata.deletionTimestamp is non-zero

To recover,
- Remove the finalizer "volumereplicationgroups.ramendr.openshift.io/vrg-protection" from the VRG resource
- Remove the finalizer "volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection" from the respective PVC resources

Result: DRPlacementControl at the hub cluster reports PeerReady condition as "true" and enables further workload relocation or failover actions
. Volume replication group deletion is stuck on a fresh volume replication created during deletion, which is stuck as the persistent volume claim cannot be updated with a finalizer

Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal `VolumeReplicaitonGroup` resource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) is attempted to be protected. The resulting cleanup operation does not complete and reports the `PeerReady` condition on the `DRPlacementControl` for the application.

This results in the application that was failed over or relocated, cannot be relocated or failed over again due to `DRPlacementControl` resource reporting its `PeerReady` condition as `false`.

Workaround:
Before applying the workaround, determine if the cause is due to protecting a PVC during `VolumeReplicationGroup` deletion as follows:

. Ensure the VolumeReplicationGroup resource in the workload namespace on the managed cluster from where it was relocated or failed over from has the following values:

- VRG `metadata.deletionTimestamp` is `non-zero`
- VRG `spec.replicationState` is `Secondary`

. List the `VolumeReplication` resources in the workload namespace as above, and ensure the resource have the following values:
- `metadata.generation` is `1`
- `spec.replicationState` is `Secondary`
- The VolumeReplication resource reports no status

. For each VolumeReplication resource in the above state, their corresponding PVC resource (as seen in the VRG `spec.dataSource` field) should have the values `metadata.deletionTimestamp` as `non-zero`

. To recover, remove the finalizer
- `volumereplicationgroups.ramendr.openshift.io/vrg-protection` from the VRG resource
- `volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection` from the respective PVC resources

Result: `DRPlacementControl` at the hub cluster reports the `PeerReady` condition as `true` and enables further workload relocation or failover actions.
CC olakra
Olive Lakra 2022-08-24 13:52:28 UTC Doc Text . Volume replication group deletion is stuck on a fresh volume replication created during deletion, which is stuck as the persistent volume claim cannot be updated with a finalizer

Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal `VolumeReplicaitonGroup` resource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) is attempted to be protected. The resulting cleanup operation does not complete and reports the `PeerReady` condition on the `DRPlacementControl` for the application.

This results in the application that was failed over or relocated, cannot be relocated or failed over again due to `DRPlacementControl` resource reporting its `PeerReady` condition as `false`.

Workaround:
Before applying the workaround, determine if the cause is due to protecting a PVC during `VolumeReplicationGroup` deletion as follows:

. Ensure the VolumeReplicationGroup resource in the workload namespace on the managed cluster from where it was relocated or failed over from has the following values:

- VRG `metadata.deletionTimestamp` is `non-zero`
- VRG `spec.replicationState` is `Secondary`

. List the `VolumeReplication` resources in the workload namespace as above, and ensure the resource have the following values:
- `metadata.generation` is `1`
- `spec.replicationState` is `Secondary`
- The VolumeReplication resource reports no status

. For each VolumeReplication resource in the above state, their corresponding PVC resource (as seen in the VRG `spec.dataSource` field) should have the values `metadata.deletionTimestamp` as `non-zero`

. To recover, remove the finalizer
- `volumereplicationgroups.ramendr.openshift.io/vrg-protection` from the VRG resource
- `volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection` from the respective PVC resources

Result: `DRPlacementControl` at the hub cluster reports the `PeerReady` condition as `true` and enables further workload relocation or failover actions.
.Volume replication group deletion is stuck on a fresh volume replication created during deletion, which is stuck as the persistent volume claim cannot be updated with a finalizer

Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal `VolumeReplicaitonGroup` resource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) is attempted to be protected. The resulting cleanup operation does not complete and reports the `PeerReady` condition on the `DRPlacementControl` for the application.

This results in the application that was failed over or relocated, cannot be relocated or failed over again due to `DRPlacementControl` resource reporting its `PeerReady` condition as `false`.

Workaround:
Before applying the workaround, determine if the cause is due to protecting a PVC during `VolumeReplicationGroup` deletion as follows:

. Ensure the VolumeReplicationGroup resource in the workload namespace on the managed cluster from where it was relocated or failed over from has the following values:

- VRG `metadata.deletionTimestamp` is `non-zero`
- VRG `spec.replicationState` is `Secondary`

. List the `VolumeReplication` resources in the workload namespace as above, and ensure the resource have the following values:
- `metadata.generation` is `1`
- `spec.replicationState` is `Secondary`
- The VolumeReplication resource reports no status

. For each VolumeReplication resource in the above state, their corresponding PVC resource (as seen in the VR `spec.dataSource` field) should have the values `metadata.deletionTimestamp` as `non-zero`

. To recover, remove the finalizer
- `volumereplicationgroups.ramendr.openshift.io/vrg-protection` from the VRG resource
- `volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection` from the respective PVC resources

Result: `DRPlacementControl` at the hub cluster reports the `PeerReady` condition as `true` and enables further workload relocation or failover actions.
Mudit Agarwal 2022-09-26 23:22:38 UTC Doc Type If docs needed, set a value Bug Fix
Doc Text .Volume replication group deletion is stuck on a fresh volume replication created during deletion, which is stuck as the persistent volume claim cannot be updated with a finalizer

Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal `VolumeReplicaitonGroup` resource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) is attempted to be protected. The resulting cleanup operation does not complete and reports the `PeerReady` condition on the `DRPlacementControl` for the application.

This results in the application that was failed over or relocated, cannot be relocated or failed over again due to `DRPlacementControl` resource reporting its `PeerReady` condition as `false`.

Workaround:
Before applying the workaround, determine if the cause is due to protecting a PVC during `VolumeReplicationGroup` deletion as follows:

. Ensure the VolumeReplicationGroup resource in the workload namespace on the managed cluster from where it was relocated or failed over from has the following values:

- VRG `metadata.deletionTimestamp` is `non-zero`
- VRG `spec.replicationState` is `Secondary`

. List the `VolumeReplication` resources in the workload namespace as above, and ensure the resource have the following values:
- `metadata.generation` is `1`
- `spec.replicationState` is `Secondary`
- The VolumeReplication resource reports no status

. For each VolumeReplication resource in the above state, their corresponding PVC resource (as seen in the VR `spec.dataSource` field) should have the values `metadata.deletionTimestamp` as `non-zero`

. To recover, remove the finalizer
- `volumereplicationgroups.ramendr.openshift.io/vrg-protection` from the VRG resource
- `volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection` from the respective PVC resources

Result: `DRPlacementControl` at the hub cluster reports the `PeerReady` condition as `true` and enables further workload relocation or failover actions.
Flags needinfo?(kramdoss)
CC kramdoss
Status POST ON_QA
krishnaram Karthick 2022-10-12 07:53:02 UTC QA Contact kramdoss prsurve
RHEL Program Management 2022-10-12 07:53:12 UTC Target Release --- ODF 4.12.0
Sidhant Agrawal 2022-11-08 12:57:55 UTC Blocks 2115507
Sunil Kumar Acharya 2022-12-08 12:55:57 UTC Flags needinfo?(srangana)
Shyamsundar 2022-12-08 16:20:54 UTC Flags needinfo?(srangana) needinfo?(olakra)
Pratik Surve 2022-12-12 06:02:53 UTC QA Contact prsurve sagrawal
Sidhant Agrawal 2022-12-26 12:09:08 UTC Status ON_QA VERIFIED
Red Hat Bugzilla 2022-12-31 19:21:12 UTC QA Contact sagrawal kramdoss
Red Hat Bugzilla 2022-12-31 19:32:31 UTC CC pdhiran
Red Hat Bugzilla 2022-12-31 19:59:56 UTC CC sseshasa
Red Hat Bugzilla 2022-12-31 20:00:24 UTC CC olakra
Red Hat Bugzilla 2022-12-31 20:04:21 UTC CC amagrawa
Red Hat Bugzilla 2022-12-31 22:37:11 UTC CC ebenahar
Red Hat Bugzilla 2023-01-01 05:47:47 UTC CC srangana
Assignee srangana nobody
Red Hat Bugzilla 2023-01-01 06:02:19 UTC CC bniver
Red Hat Bugzilla 2023-01-01 08:30:05 UTC CC bmekhiss
Red Hat Bugzilla 2023-01-01 08:31:58 UTC CC kramdoss
QA Contact kramdoss
Red Hat Bugzilla 2023-01-01 08:38:26 UTC CC nojha
Red Hat Bugzilla 2023-01-01 08:49:48 UTC CC vumrao
Alasdair Kergon 2023-01-04 04:42:51 UTC CC amagrawa
Alasdair Kergon 2023-01-04 04:47:42 UTC QA Contact sagrawal
Alasdair Kergon 2023-01-04 04:48:40 UTC CC bmekhiss
Alasdair Kergon 2023-01-04 04:52:56 UTC Assignee nobody srangana
Alasdair Kergon 2023-01-04 05:07:00 UTC CC kramdoss
Alasdair Kergon 2023-01-04 05:21:38 UTC CC nojha
Alasdair Kergon 2023-01-04 05:25:54 UTC CC olakra
Alasdair Kergon 2023-01-04 05:30:13 UTC CC pdhiran
Alasdair Kergon 2023-01-04 05:46:39 UTC CC srangana
Alasdair Kergon 2023-01-04 05:59:30 UTC CC vumrao
Alasdair Kergon 2023-01-04 06:11:25 UTC CC bniver
Alasdair Kergon 2023-01-04 06:41:59 UTC CC ebenahar
Alasdair Kergon 2023-01-04 06:56:31 UTC CC sseshasa
Erin Donnelly 2023-01-06 18:49:21 UTC Blocks 2107226
CC edonnell
Flags needinfo?(srangana)
Shyamsundar 2023-01-10 18:14:16 UTC Doc Text Cause: Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal VolumeReplicaitonGroup resource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) is attempted to be protected. The resulting cleanup operation does not complete and reports the PeerReady condition on the DRPlacementControl for the application as False.

Consequence: This results in the application that was failed over or relocated, cannot be relocated or failed over again due to DRPlacementControl resource reporting its PeerReady condition as false.

Fix: With this update, during deletion of the internal VolumeReplicationGroup resource a PVC is not attempted to be protected again, thereby avoiding the issue of a stalled cleanup.

Result: Resulting in DRPlacementControl reporting PeerReady as True post auto completion of the cleanup
Flags needinfo?(kramdoss) needinfo?(olakra) needinfo?(srangana)
Erin Donnelly 2023-01-12 17:20:29 UTC Doc Text Cause: Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal VolumeReplicaitonGroup resource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) is attempted to be protected. The resulting cleanup operation does not complete and reports the PeerReady condition on the DRPlacementControl for the application as False.

Consequence: This results in the application that was failed over or relocated, cannot be relocated or failed over again due to DRPlacementControl resource reporting its PeerReady condition as false.

Fix: With this update, during deletion of the internal VolumeReplicationGroup resource a PVC is not attempted to be protected again, thereby avoiding the issue of a stalled cleanup.

Result: Resulting in DRPlacementControl reporting PeerReady as True post auto completion of the cleanup
.Deleting the internal `VolumeReplicaitonGroup` resource from where a workload failed over or relocated from no longer causes errors

Due to a bug in the disaster recovery (DR) reconciler, during deletion of the internal `VolumeReplicaitonGroup` resource on a managed cluster, from where a workload failed over or relocated from, a persistent volume claim (PVC) was attempted to be protected. The resulting cleanup operation did not complete and would report the `PeerReady` condition on the `DRPlacementControl` for the application to be `False`. This meant the application that was failed over or relocated, could not be relocated or failed over again because the `DRPlacementControl` resource was reporting its `PeerReady` condition as `False`.

With this update, during deletion of the internal `VolumeReplicationGroup` resource, a PVC is not attempted to be protected again, thereby avoiding the issue of a stalled cleanup. This results in `DRPlacementControl` reporting `PeerReady` as `True` post auto completion of the cleanup.
Red Hat Bugzilla 2023-01-31 23:38:23 UTC CC madam
Rejy M Cyriac 2023-02-08 14:06:28 UTC Resolution --- CURRENTRELEASE
Status VERIFIED CLOSED
Last Closed 2023-02-08 14:06:28 UTC
Elad 2023-08-09 17:00:43 UTC CC odf-bz-bot

Back to bug 2116605