Bug 2186659 - External ceph storage integration failing with error message : Reconcile failed [NEEDINFO]
Summary: External ceph storage integration failing with error message : Reconcile failed
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: arun kumar mohan
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On: 2210378
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-14 04:01 UTC by Anjali
Modified: 2023-08-09 17:03 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-23 21:04:30 UTC
Embargoed:
amenon: needinfo? (nthomas)


Attachments (Terms of Use)

Description Anjali 2023-04-14 04:01:14 UTC
Description of problem:
- This is a new installation and Cu is trying to integrate external Ceph storage system as per https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/deploying_openshift_data_foundation_in_external_mode/index#creating-an-openshift-data-foundation-cluster-service-for-external-storage_ceph-external

- but is failing with below message 

status:
    conditions:
    - lastHeartbeatTime: "2023-03-29T13:04:55Z"
      lastTransitionTime: "2023-03-27T12:23:51Z"
      message: 'Error while reconciling: dial tcp 10.220.144.11:9283: i/o timeout'
      reason: ReconcileFailed
      status: "False"

- The issue is that Monitoring validation is failing from ocs-operator pod

2023-03-29T13:04:55.029009085Z {"level":"error","ts":1680095095.0288594,"logger":"controllers.StorageCluster","msg":"Monitoring validation failed","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsCephCluster).ensureCreated\n\t/remote-source/app/controllers/storagecluster/cephcluster.go:124\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/remote-source/app/controllers/storagecluster/reconcile.go:402\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/remote-source/app/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
2023-03-29T13:04:55.029009085Z {"level":"error","ts":1680095095.0289674,"logger":"controllers.StorageCluster","msg":"Could not connect to the Monitoring Endpoints.","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephCluster":"openshift-storage/ocs-external-storagecluster","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/remote-source/app/controllers/storagecluster/reconcile.go:402\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/remote-source/app/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
2023-03-29T13:04:55.039201118Z {"level":"error","ts":1680095095.039101,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-external-storagecluster","namespace":"openshift-storage","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}

- Cu is using below command for extracting Ceph cluster details. They do not specify any --monitoring-endpoint xxx.xxx.xxx.xxx --monitoring-endpoint-port xxxx in the command

# python3 ceph-external-cluster-details-exporter-tocptsatt1.py.txt --rbd-data-pool-name tocptsatt1_rbd --run-as-user client.openshift

- From the json o/p attached, it is taking the correct "MonitoringEndpoint": "10.220.144.11" and "MonitoringPort": "9283". 

[amenon@supportshell-1 03469922]$ cat 0110-output_temelli_tocptsatt1_13042023.json
[{"name": "rook-ceph-mon-endpoints", "kind": "ConfigMap", "data": {"data": "nethis01=10.220.144.11:6789", "maxMonId": "0", "mapping": "{}"}}, {"name": "rook-ceph-mon", "kind": "Secret", "data": {"admin-secret": "admin-secret", "fsid": "5bec7060-5c41-11ed-96cf-d4f5ef0ea320", "mon-secret": "mon-secret"}}, {"name": "rook-ceph-operator-creds", "kind": "Secret", "data": {"userID": "client.openshift", "userKey": "AQDDZ4RjgN8lDBAALK3Zu2Da44kM2Ic8hQ7yxA=="}}, {"name": "monitoring-endpoint", "kind": "CephCluster", "data": {"MonitoringEndpoint": "10.220.144.11", "MonitoringPort": "9283"}}, {"name": "ceph-rbd", "kind": "StorageClass", "data": {"pool": "tocptsatt1_rbd"}}, {"name": "rook-csi-rbd-node", "kind": "Secret", "data": {"userID": "csi-rbd-node", "userKey": "AQD2aIRjuXrSFRAAFG3VLrA14ajTiFJQttU0JQ=="}}, {"name": "rook-csi-rbd-provisioner", "kind": "Secret", "data": {"userID": "csi-rbd-provisioner", "userKey": "AQD2aIRjEqoXFhAAjNd6bxK05rKz4D7RnJmP9A=="}}, {"name": "rook-csi-cephfs-provisioner", "kind": "Secret", "data": {"adminID": "csi-cephfs-provisioner", "adminKey": "AQD2aIRjQyiQFhAA+z/9ZlPDJqlNDZWzC1i3Bg=="}}, {"name": "rook-csi-cephfs-node", "kind": "Secret", "data": {"adminID": "csi-cephfs-node", "adminKey": "AQD2aIRj5ldYFhAAPwabKIRY1Q/eiJjcnKR0hg=="}}, {"name": "rook-ceph-dashboard-link", "kind": "Secret", "data": {"userID": "ceph-dashboard-link", "userKey": "https://10.220.144.11:8443/"}}, {"name": "cephfs", "kind": "StorageClass", "data": {"fsName": "openshift_mds", "pool": "cephfs_data"}}]
Version-Release number of selected component (if applicable):

- ceph cluster is healthy.
[amenon@supportshell-1 ceph]$ cat ceph_status
  cluster:
    id:     5bec7060-5c41-11ed-96cf-d4f5ef0ea320
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum nethis01,nethis04,nethis07 (age 3d) <------------------------------------
    mgr: nethis01.uqavdj(active, since 2d), standbys: nethis04.qwikoq, nethis07.wkgamz
    mds: 1/1 daemons up, 1 standby
    osd: 30 osds: 30 up (since 4h), 30 in (since 4h)<------------------------------------
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 577 pgs
    objects: 12.05k objects, 26 GiB
    usage:   95 GiB used, 44 TiB / 44 TiB avail
    pgs:     577 active+clean
 
  io:
    client:   511 B/s rd, 199 KiB/s wr, 0 op/s rd, 20 op/s wr

- Cu is able to curl to  10.220.144.11:9283

Actual results:

- External ceph storage integration is failing

Expected results:
- External ceph storage integration should get get successful 

Additional info:
- All logs and other information is available on supportshell under ~/03469922

Comment 15 arun kumar mohan 2023-05-24 09:42:43 UTC
It seems like we are getting all the details correctly and from ocs-operator we are trying to ping/reach to the correct <IP>:<PORT> combination.
@amenon , meanwhile can you check with Cu ,
a. whether the same endpoint  '10.220.144.11:9283' is reachable from ocs-operator pod? 
b. if reachable can you check whether we have a lag above 5 sec?

if answer for <a> is no, then the issue is genuine.

but if we are reachable but there is a huge network lag of more than 5 sec (currently we timeout at 5 sec max), then also we have a genuine network issue in the client-side (not related to ocs-operator).

third case is a real bug, where we could externally ping the monitoring endpoint (from ocs-operator pod), but unable to do so through ocs-operator code (that is we still see the error message).


Note You need to log in before you can comment on or make changes to this bug.