Bug 2270654 - [CephFS-Mirror][RFE] - Provide metrics support for the Target Cluster Disconnection status
Summary: [CephFS-Mirror][RFE] - Provide metrics support for the Target Cluster Disconn...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 8.0
Assignee: Jos Collin
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-03-21 10:08 UTC by Hemanth Kumar
Modified: 2024-04-19 05:19 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 65364 0 None None None 2024-04-08 05:54:11 UTC
Red Hat Issue Tracker RHCEPH-8596 0 None None None 2024-03-21 10:13:34 UTC

Description Hemanth Kumar 2024-03-21 10:08:23 UTC
Description of problem:
-----------------------

Currently there is no metrics supports which can alert the user when the remote cluster is not reachable or down.

=======

[root@ceph1-hk-m-zy7rnm-node8 subvol_1]# wget http://download.eng.bos.redhat.com/rhel-8/composes/RHEL-8/RHEL-8.5.0-20220221.d.3/compose/BaseOS/x86_64/iso/RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso .
--2024-03-20 06:35:58--  http://download.eng.bos.redhat.com/rhel-8/composes/RHEL-8/RHEL-8.5.0-20220221.d.3/compose/BaseOS/x86_64/iso/RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso

[root@ceph1-hk-m-zy7rnm-node8 subvol_1]# ls
RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso  RHEL-8.6.0-20220420.3-x86_64-dvd1.iso  c51fe7a5-a10f-4d54-9a28-d1d97b440a46  hello_kernel

[root@ceph1-hk-m-zy7rnm-node8 subvol_1]# mkdir .snap/snap_k3                                                                                                                      

While the sync was in-progress. brought down the network of all MON nodes of remote cluster...

[root@ceph2-hk-m-zy7rnm-node2 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up


[root@ceph2-hk-m-zy7rnm-node3 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up


[root@ceph2-hk-m-zy7rnm-node1 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up


====== 

There are no alerts for such disconnections provided for the user.

Provide a metrics which can alert the admin when target is not reachable.

Comment 5 Venky Shankar 2024-04-19 05:19:44 UTC
I had a chat about this with Greg. Unfortunately, the messenger layer isn't the most appropriate place to look. Primary reason being the unavailability of any concrete perf counter at that layer than could hint us at possible remote being unavailable (think, MDS failover and the connection getting re-established).

So, I change my mind. It's better to build this at the mirror daemon layer. It required huerestics to be built based on MON connectivity, MDS and OSD availability and operations making progress (even though slowly).


Note You need to log in before you can comment on or make changes to this bug.