Description of problem:
- This is a new installation and Cu is trying to integrate external Ceph storage system as per https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.10/html-single/deploying_openshift_data_foundation_in_external_mode/index#creating-an-openshift-data-foundation-cluster-service-for-external-storage_ceph-external
- but is failing with below message
status:
conditions:
- lastHeartbeatTime: "2023-03-29T13:04:55Z"
lastTransitionTime: "2023-03-27T12:23:51Z"
message: 'Error while reconciling: dial tcp 10.220.144.11:9283: i/o timeout'
reason: ReconcileFailed
status: "False"
- The issue is that Monitoring validation is failing from ocs-operator pod
2023-03-29T13:04:55.029009085Z {"level":"error","ts":1680095095.0288594,"logger":"controllers.StorageCluster","msg":"Monitoring validation failed","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsCephCluster).ensureCreated\n\t/remote-source/app/controllers/storagecluster/cephcluster.go:124\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/remote-source/app/controllers/storagecluster/reconcile.go:402\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/remote-source/app/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
2023-03-29T13:04:55.029009085Z {"level":"error","ts":1680095095.0289674,"logger":"controllers.StorageCluster","msg":"Could not connect to the Monitoring Endpoints.","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephCluster":"openshift-storage/ocs-external-storagecluster","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases\n\t/remote-source/app/controllers/storagecluster/reconcile.go:402\ngithub.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile\n\t/remote-source/app/controllers/storagecluster/reconcile.go:161\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
2023-03-29T13:04:55.039201118Z {"level":"error","ts":1680095095.039101,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"ocs-external-storagecluster","namespace":"openshift-storage","error":"dial tcp 10.220.144.11:9283: i/o timeout","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}
- Cu is using below command for extracting Ceph cluster details. They do not specify any --monitoring-endpoint xxx.xxx.xxx.xxx --monitoring-endpoint-port xxxx in the command
# python3 ceph-external-cluster-details-exporter-tocptsatt1.py.txt --rbd-data-pool-name tocptsatt1_rbd --run-as-user client.openshift
- From the json o/p attached, it is taking the correct "MonitoringEndpoint": "10.220.144.11" and "MonitoringPort": "9283".
[amenon@supportshell-1 03469922]$ cat 0110-output_temelli_tocptsatt1_13042023.json
[{"name": "rook-ceph-mon-endpoints", "kind": "ConfigMap", "data": {"data": "nethis01=10.220.144.11:6789", "maxMonId": "0", "mapping": "{}"}}, {"name": "rook-ceph-mon", "kind": "Secret", "data": {"admin-secret": "admin-secret", "fsid": "5bec7060-5c41-11ed-96cf-d4f5ef0ea320", "mon-secret": "mon-secret"}}, {"name": "rook-ceph-operator-creds", "kind": "Secret", "data": {"userID": "client.openshift", "userKey": "AQDDZ4RjgN8lDBAALK3Zu2Da44kM2Ic8hQ7yxA=="}}, {"name": "monitoring-endpoint", "kind": "CephCluster", "data": {"MonitoringEndpoint": "10.220.144.11", "MonitoringPort": "9283"}}, {"name": "ceph-rbd", "kind": "StorageClass", "data": {"pool": "tocptsatt1_rbd"}}, {"name": "rook-csi-rbd-node", "kind": "Secret", "data": {"userID": "csi-rbd-node", "userKey": "AQD2aIRjuXrSFRAAFG3VLrA14ajTiFJQttU0JQ=="}}, {"name": "rook-csi-rbd-provisioner", "kind": "Secret", "data": {"userID": "csi-rbd-provisioner", "userKey": "AQD2aIRjEqoXFhAAjNd6bxK05rKz4D7RnJmP9A=="}}, {"name": "rook-csi-cephfs-provisioner", "kind": "Secret", "data": {"adminID": "csi-cephfs-provisioner", "adminKey": "AQD2aIRjQyiQFhAA+z/9ZlPDJqlNDZWzC1i3Bg=="}}, {"name": "rook-csi-cephfs-node", "kind": "Secret", "data": {"adminID": "csi-cephfs-node", "adminKey": "AQD2aIRj5ldYFhAAPwabKIRY1Q/eiJjcnKR0hg=="}}, {"name": "rook-ceph-dashboard-link", "kind": "Secret", "data": {"userID": "ceph-dashboard-link", "userKey": "https://10.220.144.11:8443/"}}, {"name": "cephfs", "kind": "StorageClass", "data": {"fsName": "openshift_mds", "pool": "cephfs_data"}}]
Version-Release number of selected component (if applicable):
- ceph cluster is healthy.
[amenon@supportshell-1 ceph]$ cat ceph_status
cluster:
id: 5bec7060-5c41-11ed-96cf-d4f5ef0ea320
health: HEALTH_OK
services:
mon: 3 daemons, quorum nethis01,nethis04,nethis07 (age 3d) <------------------------------------
mgr: nethis01.uqavdj(active, since 2d), standbys: nethis04.qwikoq, nethis07.wkgamz
mds: 1/1 daemons up, 1 standby
osd: 30 osds: 30 up (since 4h), 30 in (since 4h)<------------------------------------
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 577 pgs
objects: 12.05k objects, 26 GiB
usage: 95 GiB used, 44 TiB / 44 TiB avail
pgs: 577 active+clean
io:
client: 511 B/s rd, 199 KiB/s wr, 0 op/s rd, 20 op/s wr
- Cu is able to curl to 10.220.144.11:9283
Actual results:
- External ceph storage integration is failing
Expected results:
- External ceph storage integration should get get successful
Additional info:
- All logs and other information is available on supportshell under ~/03469922
Comment 15arun kumar mohan
2023-05-24 09:42:43 UTC
It seems like we are getting all the details correctly and from ocs-operator we are trying to ping/reach to the correct <IP>:<PORT> combination.
@amenon , meanwhile can you check with Cu ,
a. whether the same endpoint '10.220.144.11:9283' is reachable from ocs-operator pod?
b. if reachable can you check whether we have a lag above 5 sec?
if answer for <a> is no, then the issue is genuine.
but if we are reachable but there is a huge network lag of more than 5 sec (currently we timeout at 5 sec max), then also we have a genuine network issue in the client-side (not related to ocs-operator).
third case is a real bug, where we could externally ping the monitoring endpoint (from ocs-operator pod), but unable to do so through ocs-operator code (that is we still see the error message).