Description of problem: ----------------------- Currently there is no metrics supports which can alert the user when the remote cluster is not reachable or down. ======= [root@ceph1-hk-m-zy7rnm-node8 subvol_1]# wget http://download.eng.bos.redhat.com/rhel-8/composes/RHEL-8/RHEL-8.5.0-20220221.d.3/compose/BaseOS/x86_64/iso/RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso . --2024-03-20 06:35:58-- http://download.eng.bos.redhat.com/rhel-8/composes/RHEL-8/RHEL-8.5.0-20220221.d.3/compose/BaseOS/x86_64/iso/RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso [root@ceph1-hk-m-zy7rnm-node8 subvol_1]# ls RHEL-8.5.0-20220221.d.3-x86_64-dvd1.iso RHEL-8.6.0-20220420.3-x86_64-dvd1.iso c51fe7a5-a10f-4d54-9a28-d1d97b440a46 hello_kernel [root@ceph1-hk-m-zy7rnm-node8 subvol_1]# mkdir .snap/snap_k3 While the sync was in-progress. brought down the network of all MON nodes of remote cluster... [root@ceph2-hk-m-zy7rnm-node2 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up [root@ceph2-hk-m-zy7rnm-node3 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up [root@ceph2-hk-m-zy7rnm-node1 ~]# ifconfig eth0 down ; sleep 60 ; ifconfig eth0 up ====== There are no alerts for such disconnections provided for the user. Provide a metrics which can alert the admin when target is not reachable.
I had a chat about this with Greg. Unfortunately, the messenger layer isn't the most appropriate place to look. Primary reason being the unavailability of any concrete perf counter at that layer than could hint us at possible remote being unavailable (think, MDS failover and the connection getting re-established). So, I change my mind. It's better to build this at the mirror daemon layer. It required huerestics to be built based on MON connectivity, MDS and OSD availability and operations making progress (even though slowly).