Bug 2244842 - [RDR] Volume replication health shows healthy post failover while older primary cluster is still down and raises VolumeSynchronizationDelay alert for FailedOver apps
Summary: [RDR] Volume replication health shows healthy post failover while older prima...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: management-console
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.4
Assignee: gowtham
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-10-18 13:56 UTC by Aman Agrawal
Modified: 2024-05-22 04:25 UTC (History)
5 users (show)

Fixed In Version: 4.14.4-2
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-01-22 10:52:51 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage odf-console pull 1129 0 None open Bug 2244842: [release-4.14] Fix Volume replication health shows healthy post failover 2023-12-21 06:13:19 UTC
Github red-hat-storage odf-console pull 1130 0 None open Bug 2244842: [release-4.14-compatibility] Fix Volume replication health shows healthy post failover 2023-12-14 09:12:33 UTC
Red Hat Product Errata RHBA-2024:0315 0 None None None 2024-01-22 10:52:53 UTC

Description Aman Agrawal 2023-10-18 13:56:31 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ODF 4.14.0-150.stable
ACM 2.9.0-DOWNSTREAM-2023-10-12-14-53-11
advanced-cluster-management.v2.9.0-187



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy appset based workloads on a RDR setup with DR monitoring dashboard configured
2. Run IOs, check if sync is working fine
3. Bring primary cluster down, failover the workload
4. Check Volume replication health status for the failedover workload. VolumeSynchronizationDelay alert will keep on firing for this workload as older primary cluster is still down and data sync is interrupted however Volume replication health shows healthy after failover for all VR associated of this failedover application (older primary cluster remains down).

Actual results: Volume replication health shows healthy post failover while older primary cluster is still down and raises VolumeSynchronizationDelay alert for FailedOver apps


VRG yaml for failedover app from new primary cluster post failover (which was secondary earlier), older primary cluster remains down-


amagrawa:~$ oc get vrg -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: VolumeReplicationGroup
  metadata:
    creationTimestamp: "2023-10-18T11:57:12Z"
    finalizers:
    - volumereplicationgroups.ramendr.openshift.io/vrg-protection
    generation: 1
    name: busybox-workloads-2-placement-drpc
    namespace: busybox-workloads-2
    ownerReferences:
    - apiVersion: work.open-cluster-management.io/v1
      kind: AppliedManifestWork
      name: 1391f6e15d7df48686076b110a7dea069a2484056ed94fbce4b2b1bf0562e8a6-busybox-workloads-2-placement-drpc-busybox-workloads-2-vrg-mw
      uid: 5e3b7c81-9396-4927-8d7f-bc4ebb124430
    resourceVersion: "3433435"
    uid: 255eeafd-0511-46d6-95ad-e00023fa56b5
  spec:
    action: Failover
    async:
      replicationClassSelector: {}
      schedulingInterval: 15m
      volumeSnapshotClassSelector: {}
    pvcSelector:
      matchLabels:
        appname: busybox
    replicationState: primary
    s3Profiles:
    - s3profile-amagrawa-c1-14oct-ocs-storagecluster
    - s3profile-amagrawa-c2-14oct-ocs-storagecluster
    volSync: {}
  status:
    conditions:
    - lastTransitionTime: "2023-10-18T11:58:12Z"
      message: PVCs in the VolumeReplicationGroup are ready for use
      observedGeneration: 1
      reason: Ready
      status: "True"
      type: DataReady
    - lastTransitionTime: "2023-10-18T11:57:17Z"
      message: VolumeReplicationGroup is replicating
      observedGeneration: 1
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2023-10-18T11:57:13Z"
      message: Restored cluster data
      observedGeneration: 1
      reason: Restored
      status: "True"
      type: ClusterDataReady
    - lastTransitionTime: "2023-10-18T11:57:30Z"
      message: Cluster data of all PVs are protected
      observedGeneration: 1
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    kubeObjectProtection: {}
    lastUpdateTime: "2023-10-18T12:57:21Z"
    observedGeneration: 1
    protectedPVCs:
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2023-10-18T11:57:20Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-10-18T11:57:17Z"
        message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-14oct-ocs-storagecluster
          s3profile-amagrawa-c2-14oct-ocs-storagecluster]'
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2023-10-18T11:57:20Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      csiProvisioner: openshift-storage.rbd.csi.ceph.com
      labels:
        app.kubernetes.io/instance: busybox-workloads-2-amagrawa-c2-14oct
        appname: busybox
      name: dd-io-pvc-1
      replicationID:
        id: 433f6b6b47ccde08274e4a6ae1af38e44f2a435
        modes:
        - Failover
      resources:
        requests:
          storage: 117Gi
      storageClassName: ocs-storagecluster-ceph-rbd
      storageID:
        id: 4b746a22-e8d8-4a9c-8f3e-cc1d50e6c64f
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2023-10-18T11:57:20Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-10-18T11:57:17Z"
        message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-14oct-ocs-storagecluster
          s3profile-amagrawa-c2-14oct-ocs-storagecluster]'
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2023-10-18T11:57:20Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      csiProvisioner: openshift-storage.rbd.csi.ceph.com
      labels:
        app.kubernetes.io/instance: busybox-workloads-2-amagrawa-c2-14oct
        appname: busybox
      name: dd-io-pvc-2
      replicationID:
        id: 433f6b6b47ccde08274e4a6ae1af38e44f2a435
        modes:
        - Failover
      resources:
        requests:
          storage: 143Gi
      storageClassName: ocs-storagecluster-ceph-rbd
      storageID:
        id: 4b746a22-e8d8-4a9c-8f3e-cc1d50e6c64f
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2023-10-18T11:58:10Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-10-18T11:57:30Z"
        message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-14oct-ocs-storagecluster
          s3profile-amagrawa-c2-14oct-ocs-storagecluster]'
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2023-10-18T11:57:29Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      csiProvisioner: openshift-storage.rbd.csi.ceph.com
      labels:
        app.kubernetes.io/instance: busybox-workloads-2-amagrawa-c2-14oct
        appname: busybox
      name: dd-io-pvc-3
      replicationID:
        id: 433f6b6b47ccde08274e4a6ae1af38e44f2a435
        modes:
        - Failover
      resources:
        requests:
          storage: 134Gi
      storageClassName: ocs-storagecluster-ceph-rbd
      storageID:
        id: 4b746a22-e8d8-4a9c-8f3e-cc1d50e6c64f
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2023-10-18T11:57:20Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-10-18T11:57:18Z"
        message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-14oct-ocs-storagecluster
          s3profile-amagrawa-c2-14oct-ocs-storagecluster]'
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2023-10-18T11:57:20Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      csiProvisioner: openshift-storage.rbd.csi.ceph.com
      labels:
        app.kubernetes.io/instance: busybox-workloads-2-amagrawa-c2-14oct
        appname: busybox
      name: dd-io-pvc-4
      replicationID:
        id: 433f6b6b47ccde08274e4a6ae1af38e44f2a435
        modes:
        - Failover
      resources:
        requests:
          storage: 106Gi
      storageClassName: ocs-storagecluster-ceph-rbd
      storageID:
        id: 4b746a22-e8d8-4a9c-8f3e-cc1d50e6c64f
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2023-10-18T11:58:12Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-10-18T11:57:29Z"
        message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-14oct-ocs-storagecluster
          s3profile-amagrawa-c2-14oct-ocs-storagecluster]'
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2023-10-18T11:57:29Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      csiProvisioner: openshift-storage.rbd.csi.ceph.com
      labels:
        app.kubernetes.io/instance: busybox-workloads-2-amagrawa-c2-14oct
        appname: busybox
      name: dd-io-pvc-5
      replicationID:
        id: 433f6b6b47ccde08274e4a6ae1af38e44f2a435
        modes:
        - Failover
      resources:
        requests:
          storage: 115Gi
      storageClassName: ocs-storagecluster-ceph-rbd
      storageID:
        id: 4b746a22-e8d8-4a9c-8f3e-cc1d50e6c64f
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2023-10-18T11:57:25Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-10-18T11:57:18Z"
        message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-14oct-ocs-storagecluster
          s3profile-amagrawa-c2-14oct-ocs-storagecluster]'
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2023-10-18T11:57:25Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      csiProvisioner: openshift-storage.rbd.csi.ceph.com
      labels:
        app.kubernetes.io/instance: busybox-workloads-2-amagrawa-c2-14oct
        appname: busybox
      name: dd-io-pvc-6
      replicationID:
        id: 433f6b6b47ccde08274e4a6ae1af38e44f2a435
        modes:
        - Failover
      resources:
        requests:
          storage: 129Gi
      storageClassName: ocs-storagecluster-ceph-rbd
      storageID:
        id: 4b746a22-e8d8-4a9c-8f3e-cc1d50e6c64f
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2023-10-18T11:58:10Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: DataReady
      - lastTransitionTime: "2023-10-18T11:57:29Z"
        message: 'Done uploading PV/PVC cluster data to 2 of 2 S3 profile(s): [s3profile-amagrawa-c1-14oct-ocs-storagecluster
          s3profile-amagrawa-c2-14oct-ocs-storagecluster]'
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2023-10-18T11:57:28Z"
        message: PVC in the VolumeReplicationGroup is ready for use
        observedGeneration: 1
        reason: Replicating
        status: "False"
        type: DataProtected
      csiProvisioner: openshift-storage.rbd.csi.ceph.com
      labels:
        app.kubernetes.io/instance: busybox-workloads-2-amagrawa-c2-14oct
        appname: busybox
      name: dd-io-pvc-7
      replicationID:
        id: 433f6b6b47ccde08274e4a6ae1af38e44f2a435
        modes:
        - Failover
      resources:
        requests:
          storage: 149Gi
      storageClassName: ocs-storagecluster-ceph-rbd
      storageID:
        id: 4b746a22-e8d8-4a9c-8f3e-cc1d50e6c64f
    state: Primary
kind: List
metadata:
  resourceVersion: ""
amagrawa:~$ oc get vrg -o yaml | grep name
    name: busybox-workloads-2-placement-drpc
    namespace: busybox-workloads-2
      name: 1391f6e15d7df48686076b110a7dea069a2484056ed94fbce4b2b1bf0562e8a6-busybox-workloads-2-placement-drpc-busybox-workloads-2-vrg-mw
        appname: busybox
        appname: busybox
      name: dd-io-pvc-1
        appname: busybox
      name: dd-io-pvc-2
        appname: busybox
      name: dd-io-pvc-3
        appname: busybox
      name: dd-io-pvc-4
        appname: busybox
      name: dd-io-pvc-5
        appname: busybox
      name: dd-io-pvc-6
        appname: busybox
      name: dd-io-pvc-7
amagrawa:~$ oc get vrg -o yaml | grep sync
    async:
amagrawa:~$ oc get vrg -o yaml | grep Sync
    volSync: {}


Expected results: Volume replication health should show Critical post failover while older primary cluster is still down and raises VolumeSynchronizationDelay alert for FailedOver apps


Additional info:

Comment 9 krishnaram Karthick 2023-12-15 06:06:36 UTC
Moving the bug to 4.14.4 as we are doing a quick 4.14.3 to include a critical fix at RGW (2254303) before to shutdown

Comment 18 errata-xmlrpc 2024-01-22 10:52:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.4 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0315

Comment 19 Red Hat Bugzilla 2024-05-22 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.