2258861 – [RDR] [Hub recovery] Both graphs are empty for one of the managed clusters due to missing metrics

Bug 2258861 - [RDR] [Hub recovery] Both graphs are empty for one of the managed clusters due to missing metrics

Summary: [RDR] [Hub recovery] Both graphs are empty for one of the managed clusters du...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Travis Nielsen
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2253429
TreeView+	depends on / blocked

Reported:	2024-01-17 18:36 UTC by Aman Agrawal
Modified:	2024-11-15 04:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-07-17 13:12:07 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 565	None	open	Bug 2258861: exporter: Don't delete exporter service on daemon deletion	2024-02-01 20:55:55 UTC
Github	rook rook pull 13653	None	Merged	exporter: Don't delete exporter service on daemon deletion	2024-02-01 17:22:05 UTC
Github	rook rook pull 13664	None	open	core: Prevent unintentional deletion of service and service monitor during node reconciliation	2024-02-01 06:35:38 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:12:16 UTC

Description Aman Agrawal 2024-01-17 18:36:27 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ACM GA'ed 2.9.1
OCP 4.14.0-0.nightly-2024-01-04-154216
ODF 4.14.4-2
ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
**Active hub being at neutral site**

1. On a RDR setup, hub recovery was performed by bringing active hub down and moving to passive hub. Before hub recovery, all combinations of cephfs and rbd backed workloads were running on the primary managed clusters of both appset and subscription types in various current states like Deployed, FailedOver, Relocated.
2. After hub recovery when the passive hub became active, IOs were run for a few hours and data sync was progressing fine.
3. Then all the workloads with CURRENTSTATE Relocated and running on C1 were relocated to cluster C2.

busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    4d     amagrawa-c1-414                      Relocate       Relocated      Completed     2024-01-14T17:02:03Z   4m52.838865995s 
busybox-workloads-7    rbd-sub-busybox7-placement-1-drpc        4d2h   amagrawa-c1-414                      Relocate       Relocated      Completed     2024-01-14T17:02:13Z   2m29.195973432s      True
openshift-gitops       cephfs-appset-busybox11-placement-drpc   4d     amagrawa-c1-414                      Relocate       Relocated      Completed     2024-01-14T17:02:23Z   7m8.402872148s       True
openshift-gitops       rbd-appset-busybox2-placement-drpc       4d2h   amagrawa-c1-414                      Relocate       Relocated      Completed     2024-01-14T17:02:30Z   3m10.270388074s      True

4. After relocate completed, all the workloads with CURRENTSTATE FailedOver were failedover to cluster C2 by bringing C1 cluster down.

amagrawa:~$ drpc|grep Failover
busybox-workloads-14   cephfs-sub-busybox14-placement-1-drpc    2d3h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Cleaning Up   2024-01-17T16:23:22Z                         False
busybox-workloads-6    rbd-sub-busybox6-placement-1-drpc        2d3h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Cleaning Up   2024-01-17T16:23:39Z                         False
openshift-gitops       cephfs-appset-busybox10-placement-drpc   2d3h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Cleaning Up   2024-01-17T16:23:55Z                         False
openshift-gitops       rbd-appset-busybox1-placement-drpc       2d3h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Cleaning Up   2024-01-17T16:24:15Z                         False

5. After failover, C1 was brought up and cleanup completed after which, when checked DR monitoring dashboard on the RHACM console even after 1-2 hours, both the graphs didn't had values for cluster C1. However we still have a few RBD workloads running on both C1 an C2 clusters with sync interval 5m and 15min and data sync is progressing well for all the workloads. 

Actual results: Both the graphs on DR monitoring dashboard of the RHACM console doesn't show values for cluster C1.

Refer attached screencast.

From passive hub-

amagrawa:~$ drpc
NAMESPACE              NAME                                     AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION              PEER READY
busybox-workloads-14   cephfs-sub-busybox14-placement-1-drpc    2d5h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Completed     2024-01-17T16:23:22Z   26m9.126140435s       True
busybox-workloads-15   cephfs-sub-busybox15-placement-1-drpc    2d5h   amagrawa-c2-414                      Relocate       Relocated      Completed     2024-01-15T16:34:18Z   20h22m21.324413874s   True
busybox-workloads-16   cephfs-sub-busybox16-placement-1-drpc    2d5h   amagrawa-c1-414                                     Deployed       Completed                                                  True
busybox-workloads-17   cephfs-sub-busybox17-placement-1-drpc    2d5h   amagrawa-c2-414                                     Deployed       Completed                                                  True
busybox-workloads-6    rbd-sub-busybox6-placement-1-drpc        2d5h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Completed     2024-01-17T16:23:39Z   31m28.159282201s      True
busybox-workloads-7    rbd-sub-busybox7-placement-1-drpc        2d5h   amagrawa-c2-414                      Relocate       Relocated      Completed     2024-01-15T16:34:28Z   20h24m21.589457723s   True
busybox-workloads-8    rbd-sub-busybox8-placement-1-drpc        2d5h   amagrawa-c1-414                                     Deployed       Completed                                                  True
busybox-workloads-9    rbd-sub-busybox9-placement-1-drpc        2d5h   amagrawa-c2-414                                     Deployed       Completed                                                  True
openshift-gitops       cephfs-appset-busybox10-placement-drpc   2d5h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Completed     2024-01-17T16:23:55Z   25m39.391822676s      True
openshift-gitops       cephfs-appset-busybox11-placement-drpc   2d5h   amagrawa-c2-414                      Relocate       Relocated      Completed     2024-01-15T16:34:39Z   46h31m44.658735338s   True
openshift-gitops       cephfs-appset-busybox12-placement-drpc   2d5h   amagrawa-c1-414                                     Deployed       Completed                                                  True
openshift-gitops       cephfs-appset-busybox13-placement-drpc   2d5h   amagrawa-c2-414                                     Deployed       Completed                                                  True
openshift-gitops       rbd-appset-busybox1-placement-drpc       2d5h   amagrawa-c1-414    amagrawa-c2-414   Failover       FailedOver     Completed     2024-01-17T16:24:15Z   30m48.94505452s       True
openshift-gitops       rbd-appset-busybox2-placement-drpc       2d5h   amagrawa-c2-414                      Relocate       Relocated      Completed     2024-01-15T16:34:52Z   46h47m28.295172267s   True
openshift-gitops       rbd-appset-busybox3-placement-drpc       2d5h   amagrawa-c1-414                                     Deployed       Completed                                                  True
openshift-gitops       rbd-appset-busybox4-placement-drpc       2d5h   amagrawa-c1-414                                     Deployed       Completed                                                  True
openshift-gitops       rbd-appset-busybox5-placement-drpc       2d5h   amagrawa-c2-414                                     Deployed       Completed                                                  True


Logs from active hub before performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/15jan24-414-from-1st-active-hub/

Expected results: Both the graphs on DR monitoring dashboard of the RHACM console should show values for cluster C1 as well for this scenario.

Additional info:

Comment 5 Divyansh Kamboj 2024-01-18 08:57:39 UTC

The C1 cluster doesn't have rook-ceph-exporter listed in the targets, as the service and service monitor does not exist, not sure why they're not present. Looking into that

Comment 6 Divyansh Kamboj 2024-02-01 06:37:58 UTC

The problem was with reconciliation of the nodes in rook, if a node didn't have ceph pods, it would delete the service and the service monitor files as well. So depending on the order of the reconciliation of the nodes, we would either have a ceph-exporter service and service monitor or not. Leading to a lot of randomness.

This issue can be reproduced all the time if you perform an upgrade from 4.14 ocp to 4.15 ocp.

fix is up on the rook side.

Comment 7 Travis Nielsen 2024-02-01 17:22:05 UTC

Interesting that an upstream user just opened the same issue a few days ago and the PR was already in progress and now merged:
https://github.com/rook/rook/pull/13653

Will also clone for 4.14...

Comment 8 Travis Nielsen 2024-02-01 20:55:56 UTC

Neha, could you ack this blocker for 4.15?

Comment 22 errata-xmlrpc 2024-07-17 13:12:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 23 Red Hat Bugzilla 2024-11-15 04:25:15 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.