Bug 2250152

Summary: [RDR] [Hub recovery] Sync for all cephfs workloads stopped post hub recovery
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: odf-drAssignee: Benamar Mekhissi <bmekhiss>
odf-dr sub component: ramen QA Contact: Aman Agrawal <amagrawa>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: bmekhiss, muagarwa
Version: 4.14   
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-102 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2251205 (view as bug list) Environment:
Last Closed: 2024-03-19 15:29:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2251205    

Description Aman Agrawal 2023-11-16 19:05:24 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-09-204811
Volsync 0.8.0
Submariner 0.16.2
ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2        
odf-multicluster-orchestrator.v4.14.1-rhodf  
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

**Active hub at neutral site**

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster). (A few of them are exception, check drpc -o wide status in Step 3).
3. Ensure that we have the workloads in distict states like deployed, failedover, relocated etc.

Here amagrawa-10n-1 is C1 primary managed cluster for me:

From active hub-

amagrawa:hub$ drpc
NAMESPACE              NAME                                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-12   cephfs-sub-busybox-workloads-12-placement-1-drpc    7h18m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:54:29Z   5m59.196575462s   True
busybox-workloads-13   cephfs-sub-busybox-workloads-13-placement-1-drpc    7h17m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T12:12:36Z   5m58.842880173s   True
busybox-workloads-14   cephfs-sub-busybox-workloads-14-placement-1-drpc    7h16m   amagrawa-10n-1     amagrawa-10n-2    Failover       FailedOver     Completed     2023-11-16T08:29:07Z   3m19.098202668s   True
busybox-workloads-6    rbd-sub-busybox-workloads-6-placement-1-drpc        7h35m   amagrawa-10n-1                                      Deployed       Completed                                              True
busybox-workloads-7    rbd-sub-busybox-workloads-7-placement-1-drpc        7h34m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:53:38Z   9m59.85663627s    True
busybox-workloads-8    rbd-sub-busybox-workloads-8-placement-1-drpc        7h32m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:21:05Z   4m13.272955733s   True
openshift-gitops       cephfs-appset-busybox-workloads-10-placement-drpc   7h22m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:15:50Z   3m22.540081438s   True
openshift-gitops       cephfs-appset-busybox-workloads-11-placement-drpc   7h20m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:00:32Z   5m38.794985745s   True
openshift-gitops       cephfs-appset-busybox-workloads-9-placement-drpc    7h24m   amagrawa-10n-2                       Relocate       Relocated      Completed     2023-11-16T08:28:59Z   8m47.541429779s   True
openshift-gitops       rbd-appset-busybox-workloads-1-placement-drpc       7h43m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:16:14Z   8m31.330049487s   True
openshift-gitops       rbd-appset-busybox-workloads-2-placement-drpc       7h42m   amagrawa-10n-1                       Relocate       Relocated      Completed     2023-11-16T08:16:28Z   7m59.477897296s   True
openshift-gitops       rbd-appset-busybox-workloads-3-placement-drpc       7h41m   amagrawa-10n-2     amagrawa-10n-1    Failover       FailedOver     Completed     2023-11-16T08:27:18Z   7m4.760183798s    True
openshift-gitops       rbd-appset-busybox-workloads-4-placement-drpc       7h39m   amagrawa-10n-1                                      Deployed       Completed                                              True

4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.

amagrawa:hub$ group|grep SyncTime
    lastGroupSyncTime: "2023-11-16T14:01:32Z"
    lastGroupSyncTime: "2023-11-16T14:06:09Z"
    lastGroupSyncTime: "2023-11-16T14:01:03Z"
    lastGroupSyncTime: "2023-11-16T13:45:09Z"
    lastGroupSyncTime: "2023-11-16T13:50:51Z"
    lastGroupSyncTime: "2023-11-16T13:50:40Z"
    lastGroupSyncTime: "2023-11-16T14:00:51Z"
    lastGroupSyncTime: "2023-11-16T14:06:12Z"
    lastGroupSyncTime: "2023-11-16T13:01:45Z"
    lastGroupSyncTime: "2023-11-16T13:50:36Z"
    lastGroupSyncTime: "2023-11-16T13:45:16Z"
    lastGroupSyncTime: "2023-11-16T13:56:22Z"
    lastGroupSyncTime: "2023-11-16T13:45:11Z"


amagrawa:hub$ date -u
Thursday 16 November 2023 02:12:11 PM UTC

5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.

They seem to have retained their last state which was backedup. So everything is fine so far.

amagrawa:~$ drpc
NAMESPACE              NAME                                                AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME   DURATION   PEER READY
busybox-workloads-12   cephfs-sub-busybox-workloads-12-placement-1-drpc    4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
busybox-workloads-13   cephfs-sub-busybox-workloads-13-placement-1-drpc    4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
busybox-workloads-14   cephfs-sub-busybox-workloads-14-placement-1-drpc    4h16m   amagrawa-10n-1     amagrawa-10n-2    Failover       FailedOver     Completed                             True
busybox-workloads-6    rbd-sub-busybox-workloads-6-placement-1-drpc        4h16m   amagrawa-10n-1                                      Deployed       Completed                             True
busybox-workloads-7    rbd-sub-busybox-workloads-7-placement-1-drpc        4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
busybox-workloads-8    rbd-sub-busybox-workloads-8-placement-1-drpc        4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-10-placement-drpc   4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-11-placement-drpc   4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       cephfs-appset-busybox-workloads-9-placement-drpc    4h16m   amagrawa-10n-2                       Relocate       Relocated      Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-1-placement-drpc       4h16m   amagrawa-10n-1     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-2-placement-drpc       4h16m   amagrawa-10n-1                       Relocate       Relocated      Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-3-placement-drpc       4h16m   amagrawa-10n-2     amagrawa-10n-1    Failover       FailedOver     Completed                             True
openshift-gitops       rbd-appset-busybox-workloads-4-placement-drpc       4h16m   amagrawa-10n-1                                      Deployed       Completed                             True

7. Let IOs continue for a few hours. We observed that data sync for rbd based workloads were progressing just fine but sync stopped for all the cephfs based workloads be it of subsciption or appset type.


Actual results: Sync for all cephfs workloads stopped post hub recovery.

Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/16nov23/logs/

VolumeSyncronizationDelay alert fires on passive hub for all cephfs workloads when monitoring label is applied.

Expected results: Sync for all cephfs workloads should continue without any issues post hub recovery.


Additional info:

Comment 12 errata-xmlrpc 2024-03-19 15:29:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 13 Red Hat Bugzilla 2024-07-18 04:25:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days