Bug 2258560 - [RDR] [Hub recovery] [Co-situated] MCO didn’t create VRCs after hub recovery which hinders failover operation [NEEDINFO]
Summary: [RDR] [Hub recovery] [Co-situated] MCO didn’t create VRCs after hub recovery ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.15.0
Assignee: Vineet
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-16 08:25 UTC by Aman Agrawal
Modified: 2024-03-19 15:31 UTC (History)
4 users (show)

Fixed In Version: 4.15.0-130
Doc Type: Known Issue
Doc Text:
Cause: Clusterclaims are not refreshed after hub recovery Consequence: FSID not being present in ManagedCluster CRs after hub recovery. Workaround (if any): Restart OCS operator and MCO on the hub after hub recovery Result: VRC will be created once both are restarted
Clone Of:
Environment:
Last Closed: 2024-03-19 15:31:26 UTC
Embargoed:
sheggodu: needinfo? (vbadrina)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage odf-multicluster-orchestrator pull 191 0 None open Bug 2258560: [release-4.15] Fetches cluster fsid through secret and not mc 2024-02-01 06:27:22 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:31:27 UTC

Description Aman Agrawal 2024-01-16 08:25:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.15.0-0.nightly-2024-01-03-015912
ACM GA'ed 2.9.1
ODF 4.15.0-104
ceph version 17.2.6-167.el9cp (5ef1496ea3e9daaa9788809a172bd5a1c3192cf7) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Active hub co-situated with primary managed cluster

1. On a hub recovery RDR setup, ensure backups are being created on active and passive hub clusters. Failover and relocate different workloads so that it is finally running on the primary managed cluster after the failover and relocate operation completes. Ensure latest backups are taken and no action of any of the workloads (cephfs, rbd- appset and subscription type each in distinct state like Deployed, FailedOver and Relocated) is in progress.
Also have a few workloads running on the secondary managed cluster.
2. Collect drpc status. 
from active hub

amagrawa:hub$ drpc
NAMESPACE              NAME                                     AGE     PREFERREDCLUSTER   FAILOVERCLUSTER    DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-1    cephfs-sub-busybox1-placement-1-drpc     8d      amagrawa-c1-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:18:52Z   2m55.286783085s   True
busybox-workloads-10   rbd-sub-busybox10-placement-1-drpc       7h11m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:17:04Z   2.056054748s      True
busybox-workloads-11   rbd-sub-busybox11-placement-1-drpc       7h10m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:17:49Z   21.043985378s     True
busybox-workloads-12   rbd-sub-busybox12-placement-1-drpc       7h8m    amagrawa-c2-3jan                                     Deployed       Completed     2024-01-12T09:19:19Z   88.077165ms       True
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    7h4m    amagrawa-c2-3jan                                     Deployed       Completed     2024-01-12T09:23:37Z   52.119470621s     True
busybox-workloads-2    rbd-sub-busybox2-placement-1-drpc        8d      amagrawa-c1-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:19:42Z   6m5.284279162s    True
busybox-workloads-5    rbd-sub-busybox5-placement-1-drpc        4d5h    amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:19:49Z   2m58.152248084s   True
busybox-workloads-6    cephfs-sub-busybox6-placement-1-drpc     7h18m   amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:18:59Z   2m53.478551336s   True
busybox-workloads-7    cephfs-sub-busybox7-placement-1-drpc     7h16m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:11:18Z   32.120615573s     True
openshift-gitops       cephfs-appset-busybox16-placement-drpc   7h3m    amagrawa-c2-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:19:33Z   2m37.50568738s    True
openshift-gitops       cephfs-appset-busybox3-placement-drpc    4d6h    amagrawa-c2-3jan   amagrawa-c2-3jan   Failover       FailedOver     Completed     2024-01-12T12:22:54Z   2m36.257186541s   True
openshift-gitops       cephfs-appset-busybox8-placement-drpc    7h14m   amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:19:20Z   4m10.339668753s   True
openshift-gitops       cephfs-appset-busybox9-placement-drpc    7h13m   amagrawa-c1-3jan                                     Deployed       Completed     2024-01-12T09:15:06Z   32.175780774s     True
openshift-gitops       rbd-appset-busybox13-placement-drpc      7h6m    amagrawa-c1-3jan                      Relocate       Relocated      Completed     2024-01-12T12:20:02Z   6m7.188328151s    True
openshift-gitops       rbd-appset-busybox14-placement-drpc      7h5m    amagrawa-c2-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:20:12Z   5m29.864938194s   True
openshift-gitops       rbd-appset-busybox17-placement-drpc      4h3m    amagrawa-c2-3jan                                     Deployed       Completed     2024-01-12T12:24:43Z   15.046600381s     True
openshift-gitops       rbd-appset-busybox4-placement-drpc       4d6h    amagrawa-c2-3jan   amagrawa-c1-3jan   Failover       FailedOver     Completed     2024-01-12T12:01:19Z   8m27.272353624s   True

Ensure data sync is progressing well. Perform site failure meaning bring primary managed cluster along with active hub down.
3. Ensure secondary managed cluster is properly imported on the passive hub and wait for DRPolicy to get validated. 
4. On checking the drpc from passive hub, it was found that Progression state is different which is based upon last known state of the individual workloads.

amagrawa:acm$ drpc
NAMESPACE              NAME                                     AGE   PREFERREDCLUSTER   FAILOVERCLUSTER    DESIREDSTATE   CURRENTSTATE   PROGRESSION            START TIME             DURATION       PEER READY
busybox-workloads-1    cephfs-sub-busybox1-placement-1-drpc     39m   amagrawa-c1-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
busybox-workloads-10   rbd-sub-busybox10-placement-1-drpc       39m   amagrawa-c1-3jan                                                    Paused                                                       True
busybox-workloads-11   rbd-sub-busybox11-placement-1-drpc       39m   amagrawa-c1-3jan                                                    Paused                                                       True
busybox-workloads-12   rbd-sub-busybox12-placement-1-drpc       39m   amagrawa-c2-3jan                                     Deployed       Completed              2024-01-12T16:52:33Z   963.058163ms   True
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    39m   amagrawa-c2-3jan                                     Deployed       EnsuringVolSyncSetup   2024-01-12T16:53:31Z                  True
busybox-workloads-2    rbd-sub-busybox2-placement-1-drpc        39m   amagrawa-c1-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
busybox-workloads-5    rbd-sub-busybox5-placement-1-drpc        39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
busybox-workloads-6    cephfs-sub-busybox6-placement-1-drpc     39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
busybox-workloads-7    cephfs-sub-busybox7-placement-1-drpc     39m   amagrawa-c1-3jan                                                    Paused                                                       True
openshift-gitops       cephfs-appset-busybox16-placement-drpc   39m   amagrawa-c2-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
openshift-gitops       cephfs-appset-busybox3-placement-drpc    39m   amagrawa-c2-3jan   amagrawa-c2-3jan   Failover       FailedOver     Cleaning Up                                                  True
openshift-gitops       cephfs-appset-busybox8-placement-drpc    39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
openshift-gitops       cephfs-appset-busybox9-placement-drpc    39m   amagrawa-c1-3jan                                                    Paused                                                       True
openshift-gitops       rbd-appset-busybox13-placement-drpc      39m   amagrawa-c1-3jan                      Relocate                      Paused                                                       True
openshift-gitops       rbd-appset-busybox14-placement-drpc      39m   amagrawa-c2-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True
openshift-gitops       rbd-appset-busybox17-placement-drpc      39m   amagrawa-c2-3jan                                     Deployed       Completed              2024-01-12T16:52:32Z   1.263215882s   True
openshift-gitops       rbd-appset-busybox4-placement-drpc       39m   amagrawa-c2-3jan   amagrawa-c1-3jan   Failover                      Paused                                                       True

5. Since primary managed cluster is still down, data sync can't progress. Now perform failover of all the workloads which were running on the down cluster to the secondary or failovercluster and track it's progress.

After failover from C1 to C2-

amagrawa:acm$ drpc
NAMESPACE              NAME                                     AGE    PREFERREDCLUSTER   FAILOVERCLUSTER    DESIREDSTATE   CURRENTSTATE   PROGRESSION            START TIME             DURATION       PEER READY
busybox-workloads-1    cephfs-sub-busybox1-placement-1-drpc     146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover       FailedOver     Cleaning Up            2024-01-12T18:31:43Z                  False
busybox-workloads-10   rbd-sub-busybox10-placement-1-drpc       146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       True
busybox-workloads-11   rbd-sub-busybox11-placement-1-drpc       146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       True
busybox-workloads-12   rbd-sub-busybox12-placement-1-drpc       146m   amagrawa-c2-3jan                                     Deployed       Completed              2024-01-12T16:52:33Z   963.058163ms   True
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    146m   amagrawa-c2-3jan                                     Deployed       EnsuringVolSyncSetup   2024-01-12T16:53:31Z                  True
busybox-workloads-2    rbd-sub-busybox2-placement-1-drpc        146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       True
busybox-workloads-5    rbd-sub-busybox5-placement-1-drpc        146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       True
busybox-workloads-6    cephfs-sub-busybox6-placement-1-drpc     146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       False
busybox-workloads-7    cephfs-sub-busybox7-placement-1-drpc     146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       False
openshift-gitops       cephfs-appset-busybox16-placement-drpc   146m   amagrawa-c2-3jan   amagrawa-c2-3jan   Failover       FailedOver     Cleaning Up            2024-01-12T18:32:17Z                  False
openshift-gitops       cephfs-appset-busybox3-placement-drpc    146m   amagrawa-c2-3jan   amagrawa-c2-3jan   Failover       FailedOver     Cleaning Up                                                  True
openshift-gitops       cephfs-appset-busybox8-placement-drpc    146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       False
openshift-gitops       cephfs-appset-busybox9-placement-drpc    146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       False
openshift-gitops       rbd-appset-busybox13-placement-drpc      146m   amagrawa-c1-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       True
openshift-gitops       rbd-appset-busybox14-placement-drpc      146m   amagrawa-c2-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       True
openshift-gitops       rbd-appset-busybox17-placement-drpc      146m   amagrawa-c2-3jan                                     Deployed       Completed              2024-01-12T16:52:32Z   1.263215882s   True
openshift-gitops       rbd-appset-busybox4-placement-drpc       146m   amagrawa-c2-3jan   amagrawa-c2-3jan   Failover                      Paused                                                       True



Cluster C2 with name amagrawa-c2-3jan is the secondary or failover cluster which is up and running while C1 (from the logs) is down.

Actual results: From the above drpc output, failover did not even start for any of the RBD workloads. However, it did for a few cephfs but not for Cephfs workloads under NS busybox-workloads-7, 8 and 9.


Logs before peforming hub recovery is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/12jan24-active-415/

Logs from passive hub after performing failover- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/13jan24-after-failover/



Expected results: Failover should progress, all the workload pods should be up and running on the failovercluster and VRG both states should be marked as Primary.


Additional info:

Comment 11 errata-xmlrpc 2024-03-19 15:31:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.