Fedora Account System
Red Hat Associate
Red Hat Customer
Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): ACM GA'ed 2.9.1 OCP 4.14.0-0.nightly-2024-01-04-154216 ODF 4.14.4-2 ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: **Active hub being at neutral site** 1. On a RDR setup, hub recovery was performed by bringing active hub down and moving to passive hub. Before hub recovery, all combinations of cephfs and rbd backed workloads were running on the primary managed clusters of both appset and subscription types in various current states like Deployed, FailedOver, Relocated. 2. After hub recovery when the passive hub became active, IOs were run for a few hours and data sync was progressing fine. 3. Then all the workloads with CURRENTSTATE Relocated and running on C1 were relocated to cluster C2. busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 4d amagrawa-c1-414 Relocate Relocated Completed 2024-01-14T17:02:03Z 4m52.838865995s busybox-workloads-7 rbd-sub-busybox7-placement-1-drpc 4d2h amagrawa-c1-414 Relocate Relocated Completed 2024-01-14T17:02:13Z 2m29.195973432s True openshift-gitops cephfs-appset-busybox11-placement-drpc 4d amagrawa-c1-414 Relocate Relocated Completed 2024-01-14T17:02:23Z 7m8.402872148s True openshift-gitops rbd-appset-busybox2-placement-drpc 4d2h amagrawa-c1-414 Relocate Relocated Completed 2024-01-14T17:02:30Z 3m10.270388074s True 4. After relocate completed, all the workloads with CURRENTSTATE FailedOver were failedover to cluster C2 by bringing C1 cluster down. amagrawa:~$ drpc|grep Failover busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 2d3h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Cleaning Up 2024-01-17T16:23:22Z False busybox-workloads-6 rbd-sub-busybox6-placement-1-drpc 2d3h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Cleaning Up 2024-01-17T16:23:39Z False openshift-gitops cephfs-appset-busybox10-placement-drpc 2d3h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Cleaning Up 2024-01-17T16:23:55Z False openshift-gitops rbd-appset-busybox1-placement-drpc 2d3h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Cleaning Up 2024-01-17T16:24:15Z False 5. After failover, C1 was brought up and cleanup completed after which, when checked DR monitoring dashboard on the RHACM console even after 1-2 hours, both the graphs didn't had values for cluster C1. However we still have a few RBD workloads running on both C1 an C2 clusters with sync interval 5m and 15min and data sync is progressing well for all the workloads. Actual results: Both the graphs on DR monitoring dashboard of the RHACM console doesn't show values for cluster C1. Refer attached screencast. From passive hub- amagrawa:~$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-14 cephfs-sub-busybox14-placement-1-drpc 2d5h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Completed 2024-01-17T16:23:22Z 26m9.126140435s True busybox-workloads-15 cephfs-sub-busybox15-placement-1-drpc 2d5h amagrawa-c2-414 Relocate Relocated Completed 2024-01-15T16:34:18Z 20h22m21.324413874s True busybox-workloads-16 cephfs-sub-busybox16-placement-1-drpc 2d5h amagrawa-c1-414 Deployed Completed True busybox-workloads-17 cephfs-sub-busybox17-placement-1-drpc 2d5h amagrawa-c2-414 Deployed Completed True busybox-workloads-6 rbd-sub-busybox6-placement-1-drpc 2d5h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Completed 2024-01-17T16:23:39Z 31m28.159282201s True busybox-workloads-7 rbd-sub-busybox7-placement-1-drpc 2d5h amagrawa-c2-414 Relocate Relocated Completed 2024-01-15T16:34:28Z 20h24m21.589457723s True busybox-workloads-8 rbd-sub-busybox8-placement-1-drpc 2d5h amagrawa-c1-414 Deployed Completed True busybox-workloads-9 rbd-sub-busybox9-placement-1-drpc 2d5h amagrawa-c2-414 Deployed Completed True openshift-gitops cephfs-appset-busybox10-placement-drpc 2d5h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Completed 2024-01-17T16:23:55Z 25m39.391822676s True openshift-gitops cephfs-appset-busybox11-placement-drpc 2d5h amagrawa-c2-414 Relocate Relocated Completed 2024-01-15T16:34:39Z 46h31m44.658735338s True openshift-gitops cephfs-appset-busybox12-placement-drpc 2d5h amagrawa-c1-414 Deployed Completed True openshift-gitops cephfs-appset-busybox13-placement-drpc 2d5h amagrawa-c2-414 Deployed Completed True openshift-gitops rbd-appset-busybox1-placement-drpc 2d5h amagrawa-c1-414 amagrawa-c2-414 Failover FailedOver Completed 2024-01-17T16:24:15Z 30m48.94505452s True openshift-gitops rbd-appset-busybox2-placement-drpc 2d5h amagrawa-c2-414 Relocate Relocated Completed 2024-01-15T16:34:52Z 46h47m28.295172267s True openshift-gitops rbd-appset-busybox3-placement-drpc 2d5h amagrawa-c1-414 Deployed Completed True openshift-gitops rbd-appset-busybox4-placement-drpc 2d5h amagrawa-c1-414 Deployed Completed True openshift-gitops rbd-appset-busybox5-placement-drpc 2d5h amagrawa-c2-414 Deployed Completed True Logs from active hub before performing hub recovery- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/15jan24-414-from-1st-active-hub/ Expected results: Both the graphs on DR monitoring dashboard of the RHACM console should show values for cluster C1 as well for this scenario. Additional info:
The C1 cluster doesn't have rook-ceph-exporter listed in the targets, as the service and service monitor does not exist, not sure why they're not present. Looking into that
The problem was with reconciliation of the nodes in rook, if a node didn't have ceph pods, it would delete the service and the service monitor files as well. So depending on the order of the reconciliation of the nodes, we would either have a ceph-exporter service and service monitor or not. Leading to a lot of randomness. This issue can be reproduced all the time if you perform an upgrade from 4.14 ocp to 4.15 ocp. fix is up on the rook side.
Interesting that an upstream user just opened the same issue a few days ago and the PR was already in progress and now merged: https://github.com/rook/rook/pull/13653 Will also clone for 4.14...
Neha, could you ack this blocker for 4.15?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days