2270654 – [CephFS-Mirror][RFE] - Provide metrics support for the Target Cluster Disconnection status

Bug 2270654 - [CephFS-Mirror][RFE] - Provide metrics support for the Target Cluster Disconnection status

Summary: [CephFS-Mirror][RFE] - Provide metrics support for the Target Cluster Disconn...

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	8.0
Assignee:	Jos Collin
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-21 10:08 UTC by Hemanth Kumar
Modified:	2024-04-19 05:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	65364	0	None	None	None	2024-04-08 05:54:11 UTC
Red Hat Issue Tracker	RHCEPH-8596	0	None	None	None	2024-03-21 10:13:34 UTC

Description Hemanth Kumar 2024-03-21 10:08:23 UTC

Description of problem:
-----------------------

Currently there is no metrics supports which can alert the user when the remote cluster is not reachable or down.

=======

[root@ceph1-hk-m-zy7rnm-node8 subvol_1]# wget http://download.eng.bos.redhat.com/rhel-8/composes/RHEL-8/RHEL-8.5.0-20220221.d.3/compose/BaseOS/x86_64/iso/RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso .
--2024-03-20 06:35:58--  http://download.eng.bos.redhat.com/rhel-8/composes/RHEL-8/RHEL-8.5.0-20220221.d.3/compose/BaseOS/x86_64/iso/RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso

[root@ceph1-hk-m-zy7rnm-node8 subvol_1]# ls
RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso  RHEL-8.6.0-20220420.3-x86_64-dvd1.iso  c51fe7a5-a10f-4d54-9a28-d1d97b440a46  hello_kernel

[root@ceph1-hk-m-zy7rnm-node8 subvol_1]# mkdir .snap/snap_k3                                                                                                                      

While the sync was in-progress. brought down the network of all MON nodes of remote cluster...

[root@ceph2-hk-m-zy7rnm-node2 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up


[root@ceph2-hk-m-zy7rnm-node3 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up


[root@ceph2-hk-m-zy7rnm-node1 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up


====== 

There are no alerts for such disconnections provided for the user.

Provide a metrics which can alert the admin when target is not reachable.

Comment 5 Venky Shankar 2024-04-19 05:19:44 UTC

I had a chat about this with Greg. Unfortunately, the messenger layer isn't the most appropriate place to look. Primary reason being the unavailability of any concrete perf counter at that layer than could hint us at possible remote being unavailable (think, MDS failover and the connection getting re-established).

So, I change my mind. It's better to build this at the mirror daemon layer. It required huerestics to be built based on MON connectivity, MDS and OSD availability and operations making progress (even though slowly).

Note You need to log in before you can comment on or make changes to this bug.