Description of problem: - This is a new installation and Cu is trying to integrate external Ceph storage system as per https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/deploying_openshift_data_foundation_in_external_mode/index#creating-an-openshift-data-foundation-cluster-service-for-external-storage_ceph-external - but is failing with below message status: conditions: - lastHeartbeatTime: "2023-03-29T13:04:55Z" lastTransitionTime: "2023-03-27T12:23:51Z" message: 'Error while reconciling: dial tcp 10.220.144.11:9283: i/o timeout' reason: ReconcileFailed status: "False" - The issue is that Monitoring validation is failing from ocs-operator pod 2023-03-29T13:04:55.029009085Z {"level":"error","ts":1680095095.0288594,"logger":"controllers.StorageCluster","msg":"Monitoring validation failed","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsCephCluster).ensureCreated\n\t/remote-source/app/controllers/storagecluster/cephcluster.go:124\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/remote-source/app/controllers/storagecluster/reconcile.go:402\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/remote-source/app/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} 2023-03-29T13:04:55.029009085Z {"level":"error","ts":1680095095.0289674,"logger":"controllers.StorageCluster","msg":"Could not connect to the Monitoring Endpoints.","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephCluster":"openshift-storage/ocs-external-storagecluster","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/remote-source/app/controllers/storagecluster/reconcile.go:402\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/remote-source/app/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} 2023-03-29T13:04:55.039201118Z {"level":"error","ts":1680095095.039101,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-external-storagecluster","namespace":"openshift-storage","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"} - Cu is using below command for extracting Ceph cluster details. They do not specify any --monitoring-endpoint xxx.xxx.xxx.xxx --monitoring-endpoint-port xxxx in the command # python3 ceph-external-cluster-details-exporter-tocptsatt1.py.txt --rbd-data-pool-name tocptsatt1_rbd --run-as-user client.openshift - From the json o/p attached, it is taking the correct "MonitoringEndpoint": "10.220.144.11" and "MonitoringPort": "9283". [amenon@supportshell-1 03469922]$ cat 0110-output_temelli_tocptsatt1_13042023.json [{"name": "rook-ceph-mon-endpoints", "kind": "ConfigMap", "data": {"data": "nethis01=10.220.144.11:6789", "maxMonId": "0", "mapping": "{}"}}, {"name": "rook-ceph-mon", "kind": "Secret", "data": {"admin-secret": "admin-secret", "fsid": "5bec7060-5c41-11ed-96cf-d4f5ef0ea320", "mon-secret": "mon-secret"}}, {"name": "rook-ceph-operator-creds", "kind": "Secret", "data": {"userID": "client.openshift", "userKey": "AQDDZ4RjgN8lDBAALK3Zu2Da44kM2Ic8hQ7yxA=="}}, {"name": "monitoring-endpoint", "kind": "CephCluster", "data": {"MonitoringEndpoint": "10.220.144.11", "MonitoringPort": "9283"}}, {"name": "ceph-rbd", "kind": "StorageClass", "data": {"pool": "tocptsatt1_rbd"}}, {"name": "rook-csi-rbd-node", "kind": "Secret", "data": {"userID": "csi-rbd-node", "userKey": "AQD2aIRjuXrSFRAAFG3VLrA14ajTiFJQttU0JQ=="}}, {"name": "rook-csi-rbd-provisioner", "kind": "Secret", "data": {"userID": "csi-rbd-provisioner", "userKey": "AQD2aIRjEqoXFhAAjNd6bxK05rKz4D7RnJmP9A=="}}, {"name": "rook-csi-cephfs-provisioner", "kind": "Secret", "data": {"adminID": "csi-cephfs-provisioner", "adminKey": "AQD2aIRjQyiQFhAA+z/9ZlPDJqlNDZWzC1i3Bg=="}}, {"name": "rook-csi-cephfs-node", "kind": "Secret", "data": {"adminID": "csi-cephfs-node", "adminKey": "AQD2aIRj5ldYFhAAPwabKIRY1Q/eiJjcnKR0hg=="}}, {"name": "rook-ceph-dashboard-link", "kind": "Secret", "data": {"userID": "ceph-dashboard-link", "userKey": "https://10.220.144.11:8443/"}}, {"name": "cephfs", "kind": "StorageClass", "data": {"fsName": "openshift_mds", "pool": "cephfs_data"}}] Version-Release number of selected component (if applicable): - ceph cluster is healthy. [amenon@supportshell-1 ceph]$ cat ceph_status cluster: id: 5bec7060-5c41-11ed-96cf-d4f5ef0ea320 health: HEALTH_OK services: mon: 3 daemons, quorum nethis01,nethis04,nethis07 (age 3d) <------------------------------------ mgr: nethis01.uqavdj(active, since 2d), standbys: nethis04.qwikoq, nethis07.wkgamz mds: 1/1 daemons up, 1 standby osd: 30 osds: 30 up (since 4h), 30 in (since 4h)<------------------------------------ rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 577 pgs objects: 12.05k objects, 26 GiB usage: 95 GiB used, 44 TiB / 44 TiB avail pgs: 577 active+clean io: client: 511 B/s rd, 199 KiB/s wr, 0 op/s rd, 20 op/s wr - Cu is able to curl to 10.220.144.11:9283 Actual results: - External ceph storage integration is failing Expected results: - External ceph storage integration should get get successful Additional info: - All logs and other information is available on supportshell under ~/03469922
It seems like we are getting all the details correctly and from ocs-operator we are trying to ping/reach to the correct <IP>:<PORT> combination. @amenon , meanwhile can you check with Cu , a. whether the same endpoint '10.220.144.11:9283' is reachable from ocs-operator pod? b. if reachable can you check whether we have a lag above 5 sec? if answer for <a> is no, then the issue is genuine. but if we are reachable but there is a huge network lag of more than 5 sec (currently we timeout at 5 sec max), then also we have a genuine network issue in the client-side (not related to ocs-operator). third case is a real bug, where we could externally ping the monitoring endpoint (from ocs-operator pod), but unable to do so through ocs-operator code (that is we still see the error message).