2252116 – [4.14 clone][RDR] [Hub recovery] Failover of rbd workloads didn't proceed after drpc reporting WaitForStorageMaintenanceActivation

Bug 2252116 - [4.14 clone][RDR] [Hub recovery] Failover of rbd workloads didn't proceed after drpc reporting WaitForStorageMaintenanceActivation

Summary: [4.14 clone][RDR] [Hub recovery] Failover of rbd workloads didn't proceed aft...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.14.1
Assignee:	umanga
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:	2251022
Blocks:
TreeView+	depends on / blocked

Reported:	2023-11-29 15:25 UTC by umanga
Modified:	2023-12-07 13:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.14.1-15
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2251022
Environment:
Last Closed:	2023-12-07 13:21:40 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage odf-multicluster-orchestrator pull 186	0	None	Merged	Bug 2252116: [release-4.14] validate if cluster FSID is empty	2023-11-30 07:39:16 UTC
Red Hat Product Errata	RHBA-2023:7696	0	None	None	None	2023-12-07 13:21:42 UTC

Description umanga 2023-11-29 15:25:43 UTC

+++ This bug was initially created as a clone of Bug #2251022 +++

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-09-204811
Volsync 0.8.0
Submariner 0.16.2
ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2        
odf-multicluster-orchestrator.v4.14.1-rhodf  
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

**Active hub at neutral site**

We move to passive hub after active hub goes down, continue running IOs for a few days and then another disaster occurs where primary managed cluster goes down. So we require to failover those workloads to the failovercluster using passive hub.

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster).
3. Ensure that we have all the workloads in distict states like deployed, failedover, relocated etc.
4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.
5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.
They seem to have retained their last state which was backedup. So everything is fine so far.
7. Let IOs continue for a few days (3-4 days in this case). Data sync for rbd based workloads were progressing just fine. Bring down the primary managed cluster (shutdown all nodes), wait for the cluster status to change on the RHACM console.
8. Trigger failover for all rbd workloads to the secondary failover cluster via ACM UI of Passive hub and check the progress.

(Older hub remains down forever and is completely unreachable).


Actual results: Failover of rbd workloads didn't proceed after drpc reporting WaitForStorageMaintenanceActivation

From passive hub-

amagrawa:~$ date -u
Wednesday 22 November 2023 10:44:15 AM UTC


amagrawa:~$ drpc|grep rbd
busybox-workloads-6    rbd-sub-busybox-workloads-6-placement-1-drpc        5d20h   amagrawa-10n-1     amagrawa-10n-2    Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-22T10:42:01Z              False
busybox-workloads-7    rbd-sub-busybox-workloads-7-placement-1-drpc        5d20h   amagrawa-10n-1     amagrawa-10n-2    Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-22T10:42:30Z              False
busybox-workloads-8    rbd-sub-busybox-workloads-8-placement-1-drpc        5d20h   amagrawa-10n-1     amagrawa-10n-2    Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-22T10:42:46Z              False
openshift-gitops       rbd-appset-busybox-workloads-1-placement-drpc       5d20h   amagrawa-10n-1     amagrawa-10n-2    Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-22T10:42:53Z              False
openshift-gitops       rbd-appset-busybox-workloads-2-placement-drpc       5d20h   amagrawa-10n-1     amagrawa-10n-2    Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-22T10:43:01Z              False
openshift-gitops       rbd-appset-busybox-workloads-3-placement-drpc       5d20h   amagrawa-10n-2     amagrawa-10n-2    Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-22T10:43:09Z              False
openshift-gitops       rbd-appset-busybox-workloads-4-placement-drpc       5d20h   amagrawa-10n-1     amagrawa-10n-2    Failover       FailingOver    WaitForStorageMaintenanceActivation   2023-11-22T10:43:17Z              False


C2 (Failover cluster)-

amagrawa:c2$ mm
NAME                                      AGE
cf83e1357eefb8bdf1542850d66d8007d620e40   8s


amagrawa:c2$ mmyaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: MaintenanceMode
  metadata:
    creationTimestamp: "2023-11-22T10:42:13Z"
    generation: 1
    name: cf83e1357eefb8bdf1542850d66d8007d620e40
    ownerReferences:
    - apiVersion: work.open-cluster-management.io/v1
      kind: AppliedManifestWork
      name: 12468cfb98c25f5699e4c121971044ca37b147bb769b8512ab330cdc5a7c53d2-cf83e1357eefb8bdf1542850d66d8007d620e40-mmode-mw
      uid: 189dbc54-c4d4-4f17-8d31-94202eea7569
    resourceVersion: "18019820"
    uid: 42d80af3-e49c-41cc-9f4f-016de7086cb5
  spec:
    modes:
    - Failover
    storageProvisioner: openshift-storage.rbd.csi.ceph.com
    targetID: cf83e1357eefb8bdf1542850d66d8007d620e40
kind: List
metadata:
  resourceVersion: ""
amagrawa:c2$ mm
NAME                                      AGE
cf83e1357eefb8bdf1542850d66d8007d620e40   4m19s
amagrawa:c2$ pods|grep mirror
rook-ceph-rbd-mirror-a-777755497d-7x65q                           2/2     Running   1 (19h ago)     20h     10.128.2.38    compute-2   <none>           <none>



rbd-mirror deployment wasn't auto-scaleddown on  the failovercluster C2.

amagrawa:c2$ pods|grep mirror
rook-ceph-rbd-mirror-a-777755497d-7x65q                           2/2     Running   1 (20h ago)     20h     10.128.2.38    compute-2   <none>           <none>


Took the output again from C2, the observations remains same. Failover didn't even start.


amagrawa:c2$ date -u
Wednesday 22 November 2023 11:25:01 AM UTC


amagrawa:c2$ mmyaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: MaintenanceMode
  metadata:
    creationTimestamp: "2023-11-22T10:42:13Z"
    generation: 1
    name: cf83e1357eefb8bdf1542850d66d8007d620e40
    ownerReferences:
    - apiVersion: work.open-cluster-management.io/v1
      kind: AppliedManifestWork
      name: 12468cfb98c25f5699e4c121971044ca37b147bb769b8512ab330cdc5a7c53d2-cf83e1357eefb8bdf1542850d66d8007d620e40-mmode-mw
      uid: 189dbc54-c4d4-4f17-8d31-94202eea7569
    resourceVersion: "18019820"
    uid: 42d80af3-e49c-41cc-9f4f-016de7086cb5
  spec:
    modes:
    - Failover
    storageProvisioner: openshift-storage.rbd.csi.ceph.com
    targetID: cf83e1357eefb8bdf1542850d66d8007d620e40
kind: List
metadata:
  resourceVersion: ""


(C1 cluster remains down)

This leads to application unavailibility as the primary managed cluster C1 is down after disaster and workloads couldn't be failedover to C2. 

Expected results: Failover of rbd workloads should proceed and they should be accessible on the failovercluster.


Additional info:

--- Additional comment from RHEL Program Management on 2023-11-22 17:01:35 IST ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-11-22 17:01:35 IST ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Benamar Mekhissi on 2023-11-23 00:54:17 IST ---

The maintenance mode was stuck due to a mismatch of the replicationId label on the VRC and the replicationId label on the CephCluster. These two values should match.
old active hub
==============
vrc label: ramendr.openshift.io/replicationid: a99df9fc6c52c7ef44222ab38657a0b15628a14

new active hub
==============
vrc label: ramendr.openshift.io/replicationid: cf83e1357eefb8bdf1542850d66d8007d620e40

@vbadrina I am assigning this to you.

--- Additional comment from Eran Tamir on 2023-11-27 14:09:03 IST ---

Do we have a workaround for that? If not, why don't we consider it a blocker? 
@

--- Additional comment from Aman Agrawal on 2023-11-27 15:23:10 IST ---

Today, Umanga wanted us to try restarting the odfmo-controller-manager-xxxxx on the passive hub in openshift-operators NS but the setup is no longer available due to issues with datacenter as the hosts are down and cluster shows disconnected, so it couldn't be tested (ecosystem team is aware of this issue).

And yes, it is a hub recovery blocker bug. 

Relevant thread- https://chat.google.com/room/AAAAqWkMm2s/CzNW3bY-Q_U

--- Additional comment from Shyamsundar on 2023-11-28 19:19:06 IST ---

Poked around some older data and here is what is happening:

old active hub (correct)
vrc: ramendr.openshift.io/replicationid: a99df9fc6c52c7ef44222ab38657a0b15628a14

new active hub (incorrect)
vrc: ramendr.openshift.io/replicationid: cf83e1357eefb8bdf1542850d66d8007d620e40

fsid 1: 7e252ee3-abd9-4c54-a4ff-a2fdce8931a0
fsid 2: aacfbd7e-5ced-42a5-bdc2-483fcbe5a29d

Correct hash generation
$ echo -n "7e252ee3-abd9-4c54-a4ff-a2fdce8931a0-aacfbd7e-5ced-42a5-bdc2-483fcbe5a29d" | sha512sum 
a99df9fc6c52c7ef44222ab38657a0b15628a14507417e8443111e17fb9623b0194b8a84c145db0a1bdabafe573fc9b0eeb6139e356748e0bd7c533e3cb423bb  -

Incorrect hash generation when fsid's are empty (this tallies with the new active hub incorrect values)
$ echo -n "" | sha512sum 
cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e  -

Issue seems to be not checking if we have valid fsid values and generating a hash from an empty string in MCO and (re)labelling the classes with the same.

--- Additional comment from Aman Agrawal on 2023-11-29 18:09:06 IST ---

As updated in the thread, restarting odfmo-controller-manager-xxxxx pod inside openshift-operators on the passive hub didn't seem to help. The failover progression remains stuck at WaitForStorageMaintenanceActivation.
mmode was activated on the failover cluster but it didn't scale down the rbd-mirror daemon deployment and failover doesn't proceed. mmode remains activated in the same state forever.

--- Additional comment from umanga on 2023-11-29 20:53:47 IST ---

Issue is identified and fix is available. Providing devel_ack+.

Comment 12 errata-xmlrpc 2023-12-07 13:21:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.1 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7696

Note You need to log in before you can comment on or make changes to this bug.