Bug 2248666 - [RDR] [Hub recovery] DRCluster post secret creation goes to exponential backoff and takes too long to get validated
Summary: [RDR] [Hub recovery] DRCluster post secret creation goes to exponential backo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.15.0
Assignee: Elena Gershkovich
QA Contact: Aman Agrawal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-08 08:57 UTC by Aman Agrawal
Modified: 2024-03-19 15:28 UTC (History)
6 users (show)

Fixed In Version: 4.15.0-117
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-19 15:28:32 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ramen pull 171 0 None open Bug 2248666: Add watching on Secrets in drcluster_controller 2024-01-16 09:34:51 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:28:34 UTC

Description Aman Agrawal 2023-11-08 08:57:42 UTC
Description of problem (please be detailed as possible and provide log
snippests): The active hub was located at a neutral site.


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-11-06-203803
advanced-cluster-management.v2.9.0-204
ACM 2.9.0-DOWNSTREAM-2023-11-03-14-27-40
Submariner brew.registry.redhat.io/rh-osbs/iib:615928
ODF 4.14.0-161
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On a RDR setup, configure it for hub recovery with some DR protected workloads running on it.
2. Do all pre-checks such as drpolicy status, sync status, volumereplicationclass, ceph health, mirror status, lastGroupSyncTime, managedclusters -o wide status, alerts, odf pods, drpc yaml, drpc -o wide, etc.
3. After latest backup is taken, bring active hub down.
4. Restore backup on passive hub and ensure both the managed clusters are successfully imported.
7. Wait for DRPolicy to get validated. Check outputs for
oc get managedcluster -o wide -A
oc get drcluster -o yaml
oc get secrets -n openshift-operators
and notice the time it takes for DRPolicy to get validated (it is taking 15-20mins in most of the cases which is too long).



Actual results: DRCluster post secret creation goes to exponential backoff and takes too long to get validated

amagrawa:~$ drpolicy
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-10
    resourceVersion: "420339"
    uid: 11516fd4-d3c1-4b01-b460-76d13be479a3
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 10m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-15
    resourceVersion: "420371"
    uid: d6eb4cea-69ba-408e-a1c3-c4a5df83506b
  spec:
    drClusters:
    - amagrawa-m2-7nov
    - amagrawa-m1-7nov
    replicationClassSelector: {}
    schedulingInterval: 15m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m2-7nov amagrawa-m1-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-5
    resourceVersion: "420387"
    uid: 583e9ae0-7b30-4fa9-a732-e75f18d30df8
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 5m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
kind: List
metadata:
  resourceVersion: ""


amagrawa:~$ oc get managedcluster -o wide -A
NAME               HUB ACCEPTED   MANAGED CLUSTER URLS                               JOINED   AVAILABLE   AGE
amagrawa-m1-7nov   true           https://api.amagrawa-m1-7nov.qe.rh-ocs.com:6443    True     True        15m
amagrawa-m2-7nov   true           https://api.amagrawa-m2-7nov.qe.rh-ocs.com:6443    True     True        15m
local-cluster      true           https://api.amagrawa-hub2-7no.qe.rh-ocs.com:6443   True     True        3h56m


amagrawa:~$ drpolicy
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-10
    resourceVersion: "420339"
    uid: 11516fd4-d3c1-4b01-b460-76d13be479a3
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 10m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-15
    resourceVersion: "420371"
    uid: d6eb4cea-69ba-408e-a1c3-c4a5df83506b
  spec:
    drClusters:
    - amagrawa-m2-7nov
    - amagrawa-m1-7nov
    replicationClassSelector: {}
    schedulingInterval: 15m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m2-7nov amagrawa-m1-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRPolicy
  metadata:
    creationTimestamp: "2023-11-07T19:31:28Z"
    finalizers:
    - drpolicies.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: my-drpolicy-5
    resourceVersion: "420387"
    uid: 583e9ae0-7b30-4fa9-a732-e75f18d30df8
  spec:
    drClusters:
    - amagrawa-m1-7nov
    - amagrawa-m2-7nov
    replicationClassSelector: {}
    schedulingInterval: 5m
    volumeSnapshotClassSelector: {}
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:29Z"
      message: none of the DRClusters are validated ([amagrawa-m1-7nov amagrawa-m2-7nov])
      observedGeneration: 1
      reason: DRClustersUnavailable
      status: "False"
      type: Validated
kind: List
metadata:
  resourceVersion: ""


amagrawa:~$ oc get secrets -n openshift-operators
NAME                                       TYPE                                  DATA   AGE
7202c2e08afe31dc279a9730d94413a6a112650    Opaque                                2      5m22s
79f592bb6c31b5aced91d62883e7a80b6f3661f    Opaque                                2      5m23s
builder-dockercfg-6xz9h                    kubernetes.io/dockercfg               1      14h
builder-token-tvvhz                        kubernetes.io/service-account-token   4      14h
default-dockercfg-grg24                    kubernetes.io/dockercfg               1      14h
default-token-bgnr5                        kubernetes.io/service-account-token   4      14h
deployer-dockercfg-zzmz4                   kubernetes.io/dockercfg               1      14h
deployer-token-qldjm                       kubernetes.io/service-account-token   4      14h
odf-multicluster-console-serving-cert      kubernetes.io/tls                     2      99m
odfmo-controller-manager-dockercfg-qmscg   kubernetes.io/dockercfg               1      99m
odfmo-controller-manager-service-cert      kubernetes.io/tls                     3      99m
odfmo-controller-manager-token-d5xtw       kubernetes.io/service-account-token   4      99m
ramen-hub-operator-dockercfg-995qn         kubernetes.io/dockercfg               1      99m
ramen-hub-operator-service-cert            kubernetes.io/tls                     3      99m
ramen-hub-operator-token-4pm8d             kubernetes.io/service-account-token   4      99m


amagrawa:~$ oc get drcluster -o yaml
apiVersion: v1
items:
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    creationTimestamp: "2023-11-07T19:31:27Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: amagrawa-m1-7nov
    resourceVersion: "420218"
    uid: ea93e3db-febb-4ecf-a2ad-ce16bdd4259b
  spec:
    region: 9d3d039e-0ea7-4a59-a90a-b2b0ca7a4ac1
    s3ProfileName: s3profile-amagrawa-m1-7nov-ocs-storagecluster
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: 's3profile-amagrawa-m1-7nov-ocs-storagecluster: failed to get secret
        {79f592bb6c31b5aced91d62883e7a80b6f3661f } for caller drpolicy validation,
        failed to get secret {79f592bb6c31b5aced91d62883e7a80b6f3661f }, secrets "79f592bb6c31b5aced91d62883e7a80b6f3661f"
        not found'
      observedGeneration: 1
      reason: s3ConnectionFailed
      status: "False"
      type: Validated
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    phase: Available
- apiVersion: ramendr.openshift.io/v1alpha1
  kind: DRCluster
  metadata:
    creationTimestamp: "2023-11-07T19:31:27Z"
    finalizers:
    - drclusters.ramendr.openshift.io/ramen
    generation: 1
    labels:
      cluster.open-cluster-management.io/backup: resource
      velero.io/backup-name: acm-resources-generic-schedule-20231107190047
      velero.io/restore-name: restore-acm-acm-resources-generic-schedule-20231107190047
    name: amagrawa-m2-7nov
    resourceVersion: "420116"
    uid: 11aecd34-79d5-4ed1-8fe9-52564fa1623a
  spec:
    region: 9be5464c-cbbd-4dce-b3e6-ab4780d9aa0b
    s3ProfileName: s3profile-amagrawa-m2-7nov-ocs-storagecluster
  status:
    conditions:
    - lastTransitionTime: "2023-11-07T19:31:28Z"
      message: 's3profile-amagrawa-m2-7nov-ocs-storagecluster: failed to get secret
        {7202c2e08afe31dc279a9730d94413a6a112650 } for caller drpolicy validation,
        failed to get secret {7202c2e08afe31dc279a9730d94413a6a112650 }, secrets "7202c2e08afe31dc279a9730d94413a6a112650"
        not found'
      observedGeneration: 1
      reason: s3ConnectionFailed
      status: "False"
      type: Validated
    - lastTransitionTime: "2023-11-07T19:31:27Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "False"
      type: Fenced
    - lastTransitionTime: "2023-11-07T19:31:27Z"
      message: Cluster Clean
      observedGeneration: 1
      reason: Clean
      status: "True"
      type: Clean
    phase: Available
kind: List
metadata:
  resourceVersion: ""



Expected results: DRCluster validation shouldn't take more than just a few minutes to get validated eventually validating DRPolicy on passive hub


Additional info:

Comment 3 Nir Soffer 2023-11-13 14:05:47 UTC
Aman, can you share the ramen configmap from the managed clusters and hub?

Comment 4 Aman Agrawal 2023-11-14 06:54:03 UTC
Hi Nir,

Must gather logs could be found here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/08nov23/

I don't have the setup to collect specifics. Hope this helps.

Comment 13 errata-xmlrpc 2024-03-19 15:28:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.