Bug 2248666

Summary: [RDR] [Hub recovery] DRCluster post secret creation goes to exponential backoff and takes too long to get validated
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: odf-drAssignee: Elena Gershkovich <egershko>
odf-dr sub component: ramen QA Contact: Aman Agrawal <amagrawa>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: egershko, kseeger, muagarwa, nsoffer, rtalur, srangana
Version: 4.14   
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-117 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 15:28:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aman Agrawal 2023-11-08 08:57:42 UTC
Description of problem (please be detailed as possible and provide log
snippests): The active hub was located at a neutral site.


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-06-203803
advanced-cluster-management.v2.9.0-204
ACM 2.9.0-DOWNSTREAM-2023-11-03-14-27-40
Submariner brew.registry.redhat.io/rh-osbs/iib:615928
ODF 4.14.0-161
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup, configure it for hub recovery with some DR protected workloads running on it.
2. Do all pre-checks such as drpolicy status, sync status, volumereplicationclass, ceph health, mirror status, lastGroupSyncTime, managedclusters -o wide status, alerts, odf pods, drpc yaml, drpc -o wide, etc.
3. After latest backup is taken, bring active hub down.
4. Restore backup on passive hub and ensure both the managed clusters are successfully imported.
7. Wait for DRPolicy to get validated. Check outputs for
oc get managedcluster -o wide -A
oc get drcluster -o yaml
oc get secrets -n openshift-operators
and notice the time it takes for DRPolicy to get validated (it is taking 15-20mins in most of the cases which is too long).



Actual results: DRCluster post secret creation goes to exponential backoff and takes too long to get validated

amagrawa:~$ drpolicy
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-10
    resourceVersion: "420339"
    uid: 11516fd4-d3c1-4b01-b460-76d13be479a3
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 10m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-15
    resourceVersion: "420371"
    uid: d6eb4cea-69ba-408e-a1c3-c4a5df83506b
  spec:
    drClusters:
    - amagrawa-m2-7nov
    - amagrawa-m1-7nov
    replicationClassSelector: {}
    schedulingInterval: 15m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m2-7nov amagrawa-m1-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-5
    resourceVersion: "420387"
    uid: 583e9ae0-7b30-4fa9-a732-e75f18d30df8
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 5m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
kind: List
metadata:
  resourceVersion: ""


amagrawa:~$ oc get managedcluster -o wide -A
NAME               HUB ACCEPTED   MANAGED CLUSTER URLS                               JOINED   AVAILABLE   AGE
amagrawa-m1-7nov   true           https://api.amagrawa-m1-7nov.qe.rh-ocs.com:6443    True     True        15m
amagrawa-m2-7nov   true           https://api.amagrawa-m2-7nov.qe.rh-ocs.com:6443    True     True        15m
local-cluster      true           https://api.amagrawa-hub2-7no.qe.rh-ocs.com:6443   True     True        3h56m


amagrawa:~$ drpolicy
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-10
    resourceVersion: "420339"
    uid: 11516fd4-d3c1-4b01-b460-76d13be479a3
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 10m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-15
    resourceVersion: "420371"
    uid: d6eb4cea-69ba-408e-a1c3-c4a5df83506b
  spec:
    drClusters:
    - amagrawa-m2-7nov
    - amagrawa-m1-7nov
    replicationClassSelector: {}
    schedulingInterval: 15m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m2-7nov amagrawa-m1-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-5
    resourceVersion: "420387"
    uid: 583e9ae0-7b30-4fa9-a732-e75f18d30df8
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 5m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
kind: List
metadata:
  resourceVersion: ""


amagrawa:~$ oc get secrets -n openshift-operators
NAME                                       TYPE                                  DATA   AGE
7202c2e08afe31dc279a9730d94413a6a112650    Opaque                                2      5m22s
79f592bb6c31b5aced91d62883e7a80b6f3661f    Opaque                                2      5m23s
builder-dockercfg-6xz9h                    kubernetes.io/dockercfg               1      14h
builder-token-tvvhz                        kubernetes.io/service-account-token   4      14h
default-dockercfg-grg24                    kubernetes.io/dockercfg               1      14h
default-token-bgnr5                        kubernetes.io/service-account-token   4      14h
deployer-dockercfg-zzmz4                   kubernetes.io/dockercfg               1      14h
deployer-token-qldjm                       kubernetes.io/service-account-token   4      14h
odf-multicluster-console-serving-cert      kubernetes.io/tls                     2      99m
odfmo-controller-manager-dockercfg-qmscg   kubernetes.io/dockercfg               1      99m
odfmo-controller-manager-service-cert      kubernetes.io/tls                     3      99m
odfmo-controller-manager-token-d5xtw       kubernetes.io/service-account-token   4      99m
ramen-hub-operator-dockercfg-995qn         kubernetes.io/dockercfg               1      99m
ramen-hub-operator-service-cert            kubernetes.io/tls                     3      99m
ramen-hub-operator-token-4pm8d             kubernetes.io/service-account-token   4      99m


amagrawa:~$ oc get drcluster -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    creationTimestamp: "2023-11-07T19:31:27Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: amagrawa-m1-7nov
    resourceVersion: "420218"
    uid: ea93e3db-febb-4ecf-a2ad-ce16bdd4259b
  spec:
    region: 9d3d039e-0ea7-4a59-a90a-b2b0ca7a4ac1
    s3ProfileName: s3profile-amagrawa-m1-7nov-ocs-storagecluster
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: 's3profile-amagrawa-m1-7nov-ocs-storagecluster: failed to get secret
        {79f592bb6c31b5aced91d62883e7a80b6f3661f } for caller drpolicy validation,
        failed to get secret {79f592bb6c31b5aced91d62883e7a80b6f3661f }, secrets "79f592bb6c31b5aced91d62883e7a80b6f3661f"
        not found'
      observedGeneration: 1
      reason: s3ConnectionFailed
      status: "False"
      type: Validated
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    phase: Available
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    creationTimestamp: "2023-11-07T19:31:27Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: amagrawa-m2-7nov
    resourceVersion: "420116"
    uid: 11aecd34-79d5-4ed1-8fe9-52564fa1623a
  spec:
    region: 9be5464c-cbbd-4dce-b3e6-ab4780d9aa0b
    s3ProfileName: s3profile-amagrawa-m2-7nov-ocs-storagecluster
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: 's3profile-amagrawa-m2-7nov-ocs-storagecluster: failed to get secret
        {7202c2e08afe31dc279a9730d94413a6a112650 } for caller drpolicy validation,
        failed to get secret {7202c2e08afe31dc279a9730d94413a6a112650 }, secrets "7202c2e08afe31dc279a9730d94413a6a112650"
        not found'
      observedGeneration: 1
      reason: s3ConnectionFailed
      status: "False"
      type: Validated
    - lastTransitionTime: "2023-11-07T19:31:27Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-07T19:31:27Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    phase: Available
kind: List
metadata:
  resourceVersion: ""



Expected results: DRCluster validation shouldn't take more than just a few minutes to get validated eventually validating DRPolicy on passive hub


Additional info:

Comment 3 Nir Soffer 2023-11-13 14:05:47 UTC
Aman, can you share the ramen configmap from the managed clusters and hub?

Comment 4 Aman Agrawal 2023-11-14 06:54:03 UTC
Hi Nir,

Must gather logs could be found here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/08nov23/

I don't have the setup to collect specifics. Hope this helps.

Comment 13 errata-xmlrpc 2024-03-19 15:28:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383