2247542 – [RDR] [Hub-recovery] Failover didn't succeed for cephfs backed workloads

Bug 2247542 - [RDR] [Hub-recovery] Failover didn't succeed for cephfs backed workloads

Summary: [RDR] [Hub-recovery] Failover didn't succeed for cephfs backed workloads

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Shyamsundar
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:	verification-blocked
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-11-01 21:53 UTC by Aman Agrawal
Modified:	2024-03-19 15:28 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:28:14 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2024:1383	0	None	None	None	2024-03-19 15:28:15 UTC

Description Aman Agrawal 2023-11-01 21:53:08 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-10-30-170011
advanced-cluster-management.v2.9.0-188 
ODF 4.14.0-157
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25
Submariner brew.registry.redhat.io/rh-osbs/iib:607438


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Steps:
1. On a hub recovery RDR setup, ensure backups are being created on active and passive hub clusters. Failover and relocate different workloads so that it is running on the primary managed cluster after the failover and relocate operation completes. Ensure latest backups are taken and no action of any of the workloads (cephfs, rbd- appset or subscription type) is in progress.
2. Collect drpc status. Bring primary managed cluster down, and then bring active hub down.
3. Ensure secondary managed cluster is properly imported on the passive hub and then DRPolicy gets validated. 
4. Check the drpc status from passive hub and compare it with the output taken from active hub when it was up. We notice that post hub recovery, a sanity check is run for all the workloads which were failedover or relocated where we again perform the same action on those workloads which was performed from the active hub, which marks peer ready as false for those workloads.

From active hub-

NAMESPACE             NAME                                   AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION             PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   9h    amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocated      Completed     2023-11-01T17:54:21Z   30.282249722s        True
busybox-workloads-5   subscription-rbd1-placement-1-drpc     9h    amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailedOver     Completed     2023-11-01T13:57:37Z   47m3.364814169s      True
busybox-workloads-6   subscription-rbd2-placement-1-drpc     9h    amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocated      Completed     2023-11-01T14:16:28Z   3h17m50.318760845s   True
openshift-gitops      appset-cephfs-placement-drpc           9h    amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     Completed     2023-11-01T13:20:45Z   5m59.4021061s        True
openshift-gitops      appset-rbd1-placement-drpc             9h    amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailedOver     Completed     2023-11-01T14:15:30Z   41m2.588884417s      True
openshift-gitops      appset-rbd2-placement-drpc             9h    amagrawa-passivee                                      Deployed       Completed                                                 True


From passive hub-

amagrawa:~$ drpc
NAMESPACE             NAME                                   AGE   PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION                           START TIME             DURATION   PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   57m   amagrawa-31o-prim   amagrawa-passivee   Relocate       Relocating                                           2023-11-01T18:59:35Z              False
busybox-workloads-5   subscription-rbd1-placement-1-drpc     57m   amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-01T18:59:36Z              False
busybox-workloads-6   subscription-rbd2-placement-1-drpc     57m   amagrawa-31o-prim   amagrawa-passivee   Relocate                                                                                              True
openshift-gitops      appset-cephfs-placement-drpc           57m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     EnsuringVolSyncSetup                                                    True
openshift-gitops      appset-rbd1-placement-drpc             57m   amagrawa-31o-prim   amagrawa-31o-prim   Failover       FailingOver    FailingOverToCluster                  2023-11-01T18:59:36Z              False
openshift-gitops      appset-rbd2-placement-drpc             57m   amagrawa-passivee                                      Deployed       Completed                                                               True


Since peer ready is now marked as false due to sanity check, subscription-cephfs-placement-1-drpc and subscription-rbd1-placement-1-drpc and appset-rbd1-placement-drpc can not be failedover in this example.

This sanity check is needed as per k8s recommended guidelines and we should not backup the currentstate of the workloads as confirmed by @bmekhiss so the issue will always persist.

As of now, the only option is to trigger a failover by editing drpc yaml (which would be addressed by BZ2247537).

So all these apps were failedover via CLI to the secondary managed cluster which was available but the failover didn't succeed for rbd backed workloads as volumereplicationclass was not backed up/got deleted.

Benamar tried a WA which created the volumereplicationclass on the secondary managed cluster which was available.

This helped failover to proceed and created the workloads pods but not the VR's for rbd backed workloads, so VRG CURRENTSTATE couldn't be marked as Primary. We need VR's to be created for rbd backed workloads so the workaround didn't work as expected which is updated https://bugzilla.redhat.com/show_bug.cgi?id=2246084#c8.

Since volumereplicationclass is not needed for cephfs based workloads, failover didn't succeed for some other reason and is under RCA and hence a separate BZ is being opened for it.


From passive hub after triggering failover from CLI-

amagrawa:~$ drpc
NAMESPACE             NAME                                   AGE     PREFERREDCLUSTER    FAILOVERCLUSTER     DESIREDSTATE   CURRENTSTATE   PROGRESSION                 START TIME             DURATION   PEER READY
busybox-workloads-2   subscription-cephfs-placement-1-drpc   3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailingOver    WaitingForResourceRestore   2023-11-01T18:59:35Z              False
busybox-workloads-5   subscription-rbd1-placement-1-drpc     3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     WaitForReadiness            2023-11-01T18:59:36Z              True
busybox-workloads-6   subscription-rbd2-placement-1-drpc     3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     WaitForReadiness            2023-11-01T20:12:09Z              True
openshift-gitops      appset-cephfs-placement-drpc           3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     EnsuringVolSyncSetup                                          True
openshift-gitops      appset-rbd1-placement-drpc             3h21m   amagrawa-31o-prim   amagrawa-passivee   Failover       FailedOver     WaitForReadiness            2023-11-01T18:59:36Z              True
openshift-gitops      appset-rbd2-placement-drpc             3h21m   amagrawa-passivee                                      Deployed       Completed                                                     True


From secondary available managed cluster to which failover was triggered-

amagrawa:~$ busybox-2
Already on project "busybox-workloads-2" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443".
NAME                                                                               DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/subscription-cephfs-placement-1-drpc   primary     

amagrawa:~$ oc describe vrg
Events:
  Type     Reason           Age                 From                               Message
  ----     ------           ----                ----                               -------
  Warning  VrgUploadFailed  11s (x33 over 91m)  controller_VolumeReplicationGroup  failed to upload data of odrbucket-93489a7b9ef9:busybox-workloads-2/subscription-cephfs-placement-1-drpc/v1alpha1.VolumeReplicationGroup/a, RequestError: send request failed
caused by: Put "https://s3-openshift-storage.apps.amagrawa-31o-prim.qe.rh-ocs.com/odrbucket-93489a7b9ef9/busybox-workloads-2/subscription-cephfs-placement-1-drpc/v1alpha1.VolumeReplicationGroup/a": dial tcp 10.19.98.14:443: i/o timeout


From passive hub during this time-

amagrawa:~$ oc get drpc -o yaml -n busybox-workloads-2
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPlacementControl
  metadata:
    annotations:
      drplacementcontrol.ramendr.openshift.io/app-namespace: busybox-workloads-2
      drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: amagrawa-31o-prim
    creationTimestamp: "2023-11-01T18:05:45Z"
    finalizers:
    - drpc.ramendr.openshift.io/finalizer
    generation: 2
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231101180053
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231101180053
    name: subscription-cephfs-placement-1-drpc
    namespace: busybox-workloads-2
    ownerReferences:
    - apiVersion: cluster.open-cluster-management.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Placement
      name: subscription-cephfs-placement-1
      uid: 0ca2e6ac-5942-43f9-8c55-272b1b70a919
    resourceVersion: "1079179"
    uid: b7024c21-2bdf-4f43-8577-db3505a89104
  spec:
    action: Failover
    drPolicyRef:
      apiVersion: ramendr.openshift.io/v1alpha1
      kind: DRPolicy
      name: my-drpolicy-10
    failoverCluster: amagrawa-passivee
    placementRef:
      apiVersion: cluster.open-cluster-management.io/v1beta1
      kind: Placement
      name: subscription-cephfs-placement-1
      namespace: busybox-workloads-2
    preferredCluster: amagrawa-31o-prim
    pvcSelector:
      matchLabels:
        appname: busybox_app1_cephfs
  status:
    actionStartTime: "2023-11-01T18:59:35Z"
    conditions:
    - lastTransitionTime: "2023-11-01T20:10:24Z"
      message: Waiting for App resources to be restored...)
      observedGeneration: 2
      reason: FailingOver
      status: "False"
      type: Available
    - lastTransitionTime: "2023-11-01T20:10:24Z"
      message: Started failover to cluster "amagrawa-passivee"
      observedGeneration: 2
      reason: NotStarted
      status: "False"
      type: PeerReady
    lastUpdateTime: "2023-11-01T20:22:55Z"
    phase: FailingOver
    preferredDecision:
      clusterName: amagrawa-31o-prim
      clusterNamespace: amagrawa-31o-prim
    progression: WaitingForResourceRestore
    resourceConditions:
      resourceMeta:
        generation: 0
        kind: ""
        name: ""
        namespace: ""
kind: List
metadata:
  resourceVersion: ""



Actual results: Failover didn't succeed for cephfs backed workloads


Logs are being uploaded here (collected a few hours after triggering failover from CLI)-
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/02nov23-2/

Expected results: Failover should complete without cleanup as older primary cluster is still down and should eventually cleanup and mark VRG as secondary when the older primary cluster becomes reachable and data sync should resume as expected.


Additional info:

Comment 5 Mudit Agarwal 2023-11-07 11:43:06 UTC

Moving hub recovery issues out to 4.15 based on offline discussion.

Comment 11 Mudit Agarwal 2024-01-25 04:14:01 UTC

BZ2258351 is ON_QA now, please retry

Comment 17 errata-xmlrpc 2024-03-19 15:28:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.