+++ This bug was initially created as a clone of Bug #2141003 +++ +++ This bug was initially created as a clone of Bug #2118627 +++ <huge snip> Core problem was 2 fold: 1. Incorrect network configuration, which resulted in RBD mirror failing to establish a healthy connection with its peers: - This is resolved from a troubleshooting POV using rook krew plugins and RH insights rules to troubleshoot the issue faster 2. Intermittent network connectivity issues, which MAY result in mirroring delays and schedule lags - Schedule lags are covered by scraping rbd mirror image status information and representing this to the user as a lastSyncTime for a volume (or a group of volumes) - This BZ is to request an additional health metric from Ceph for peer connectivity to help observe and accelerate troubleshooting these intermittent network issues (by not looking at RBD peer warning states, but focusing on the network stack instead) Part of the discussion relevant to this is pasted from the original BZ below: (In reply to Shyamsundar from comment #24) > Thanks Annette! > > @idryomov we would potentially need a health metric per peer from > the RBD mirror daemon that can demonstrate peer connectivity. For example > over time if peer connectivity has hiccups or is down for a longer period of > time etc. it is relevant to look at this health metric and ensure > connectivity to a peer was active or had issues. The health should also > include actual ceph connectivity and not the generic ping. > > Should we open a separate BZ for this and track it as a dependency here? Hi Shyam, Sure, feel free to open an RHCS BZ. As discussed in the meeting, this isn't going to be trivial to implement but we may be able to slice up the problem space and derive something from secondary indicators such as the startup time or, post startup, things like overall throughput. That said, any such derivation would be in direct contradiction with the ask (In reply to Shyamsundar from comment #25) > This hence potentially brings this down to RBD mirror daemon to report a > per-peer connectivity health metric. Technically the current mirror pool > health is showing up a WARNING but finer distinction of the WARNING as > applicable would help in such cases. ... because that finer distinction is exactly the tricky part here. --- Additional comment from Mudit Agarwal on 2022-11-07 19:29:10 EST --- Not a TP blocker, Shyam please open a RHCS BZ --- Additional comment from Red Hat Bugzilla on 2022-12-31 20:04:19 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 22:37:30 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 23:39:24 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 23:45:33 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:47:46 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:47:58 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 05:48:08 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 06:02:29 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:32:30 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:44:15 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:45:34 UTC --- Account disabled by LDAP Audit --- Additional comment from Red Hat Bugzilla on 2023-01-31 23:39:06 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-31 23:41:02 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Ilya Dryomov on 2023-03-13 20:29:55 UTC --- Hi JuanMi, I assume Pere's https://github.com/ceph/ceph/pull/50393 is the first step towards this? --- Additional comment from Ilya Dryomov on 2023-04-21 09:09:35 UTC --- --- Additional comment from Juan Miguel Olmo on 2023-05-11 15:54:53 UTC --- yes Ilya: https://github.com/ceph/ceph/pull/50393 is the first step towards solve this. --- Additional comment from Greg Farnum on 2023-05-11 16:04:55 UTC --- There's nothing about this bz or upstream PR that merits it being in POST. --- Additional comment from Red Hat Bugzilla on 2023-07-31 21:50:20 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-07-31 21:50:29 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-08-03 08:29:09 UTC --- Account disabled by LDAP Audit --- Additional comment from Mudit Agarwal on 2023-09-12 10:26:43 UTC --- AFAIK this is just a backport, any reasons we are not taking this in 6.1z2? This is a blocker for Regional DR --- Additional comment from Juan Miguel Olmo on 2023-09-12 10:34:36 UTC --- Backport to reef on going: https://github.com/ceph/ceph/pull/53033 --- Additional comment from Ilya Dryomov on 2023-09-12 12:44:58 UTC --- (In reply to Juan Miguel Olmo from comment #22) > Backport to reef on going: > https://github.com/ceph/ceph/pull/53033 Hi JuanMi, This is 6.1, so upstream backport to reef is irrelevant. This would be a downstream-only backport to quincy. If there is no other BZ you were planning to use (I'm completely lost in monitoring/metrics BZs), let's move this one back to 6.1z2. --- Additional comment from Scott Ostapovicz on 2023-09-12 12:50:33 UTC --- Retargeted to 6.1 z2 --- Additional comment from Sunil Angadi on 2023-09-14 10:39:29 UTC --- Hi Pere, when can we expect this bz to ON_QA? also can you please provide the steps to verify this bz? what actions make metrics counter value get triggered? Is it just metrics or alerts also being covered? --- Additional comment from Pere Diaz Bou on 2023-09-15 08:51:12 UTC --- Hi Sunil, Currently these are only metrics exposed by ceph-exporter and not from the prometheus mgr module. You can easily find them by deploying a cluster with ceph-exporter and the looking for metrics that include "ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts" and "ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts". ``` # on vstart.sh setup. On cephadm setup let cephadm deploy ceph-exporter bin/ceph-exporter --sock-dir $PWD/asok --port 9999 --conf ceph.conf --addrs 127.0.0.1 # find new metrics curl localhost:9999/metrics | grep ceph_AsyncMessenger_Worker_msgr_ ``` you will find something like: ``` ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="1"} 0 ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="2"} 0 # HELP ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts Number of not yet ready connections declared as dead # TYPE ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts counter ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="0"} 0 ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="1"} 0 ``` --- Additional comment from Sunil Angadi on 2023-09-20 05:50:36 UTC --- Hi Build team, can you please get a build for this bz to ON_QA soon? --- Additional comment from on 2023-09-20 06:01:27 UTC --- (In reply to Sunil Angadi from comment #27) > Hi Build team, > can you please get a build for this bz to ON_QA soon? Sorry, I missed this, but are the commits even downstream on ceph-6.1-rhel-patches? https://gitlab.cee.redhat.com/ceph/ceph/-/commits/ceph-6.1-rhel-patches Thomas --- Additional comment from Ilya Dryomov on 2023-09-20 06:41:04 UTC --- (In reply to tserlin from comment #28) > (In reply to Sunil Angadi from comment #27) > > Hi Build team, > > can you please get a build for this bz to ON_QA soon? > > Sorry, I missed this, but are the commits even downstream on > ceph-6.1-rhel-patches? > > https://gitlab.cee.redhat.com/ceph/ceph/-/commits/ceph-6.1-rhel-patches Hi Thomas, No, the downstream MR is still being reviewed by Radoslaw (the component on this BZ is misleading -- it originated as a Regional DR requirement, hence RBD-Mirror, but change ended up being entirely in RADOS). --- Additional comment from Pere Diaz Bou on 2023-09-21 10:03:28 UTC --- As an update, we found out while testing https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/379?commit_id=fa70882a9b477e74233fe8ce2fed0218e83ec09c (the mr of this bz) that important labels like `ceph_daemon` were missing from labeled perfcounters. It should be ok to merge into 6.1-patches but it should be noted that it will be unusable until a new fix for that is merged and backported https://github.com/ceph/ceph/pull/53523. --- Additional comment from Ilya Dryomov on 2023-09-21 12:53:16 UTC --- Pushed to ceph-6.1-rhel-patches based on Radoslaw's approval in https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/379. --- Additional comment from errata-xmlrpc on 2023-09-21 17:17:46 UTC --- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHSA-2023:118540-05 https://errata.devel.redhat.com/advisory/118540 --- Additional comment from errata-xmlrpc on 2023-09-21 17:17:53 UTC --- This bug has been added to advisory RHSA-2023:118540 by Thomas Serlin (tserlin) --- Additional comment from Sunil Angadi on 2023-09-22 07:21:53 UTC --- (In reply to Pere Diaz Bou from comment #26) > Hi Sunil, > > Currently these are only metrics exposed by ceph-exporter and not from the > prometheus mgr module. > > You can easily find them by deploying a cluster with ceph-exporter and the > looking for metrics that include > "ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts" and > "ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts". > > ``` > # on vstart.sh setup. On cephadm setup let cephadm deploy ceph-exporter > bin/ceph-exporter --sock-dir $PWD/asok --port 9999 --conf ceph.conf --addrs > 127.0.0.1 > > # find new metrics > curl localhost:9999/metrics | grep ceph_AsyncMessenger_Worker_msgr_ > ``` > > you will find something like: > ``` > ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="1"} 0 > ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="2"} 0 > # HELP ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts Number of > not yet ready connections declared as dead > # TYPE ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts counter > ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="0"} 0 > ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="1"} 0 > ``` Tested using, ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable) Hi pere, after deploying ceph-exporter service, it's not running on any port [ceph: root@ceph-rbd1-sbz-i8apyw-node1-installer /]# ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.ceph-rbd1-sbz-i8apyw-node4 ceph-rbd1-sbz-i8apyw-node4 *:9093,9094 running (28m) 100s ago 29m 27.3M - 0.24.0 1616eb1ba330 293193d384a7 ceph-exporter.ceph-rbd1-sbz-i8apyw-node1-installer ceph-rbd1-sbz-i8apyw-node1-installer running (22m) 100s ago 22m 6089k - 17.2.6-146.el9cp e2affc76d032 4895a43f30b7 grafana.ceph-rbd1-sbz-i8apyw-node1-installer ceph-rbd1-sbz-i8apyw-node1-installer *:3000 running (28m) 100s ago 28m 86.8M - 9.4.7 75cd3ecd64ca c763681e1b08 mgr.ceph-rbd1-sbz-i8apyw-node1-installer.hksbsa ceph-rbd1-sbz-i8apyw-node1-installer *:9283 running (88m) 100s ago 88m 487M - 17.2.6-146.el9cp e2affc76d032 df120c8228a0 mgr.ceph-rbd1-sbz-i8apyw-node3.iltfsf ceph-rbd1-sbz-i8apyw-node3 *:8443 running (86m) 8m ago 86m 407M - 17.2.6-146.el9cp e2affc76d032 8d30677637d3 ceph-exporter is running on admin node i.e node1 [ceph: root@ceph-rbd1-sbz-i8apyw-node1-installer /]# curl http://10.0.206.210:9999/metrics curl: (7) Failed to connect to 10.0.206.210 port 9999: Connection refused so i am not able to get any of the above-mentioned metrics can you please check? --- Additional comment from Sunil Angadi on 2023-09-25 07:24:38 UTC --- Checked with Pere and Nizam, the port which is running on ceph-exporter is 9926, [root@ceph-rbd1-sbz-i8apyw-node1-installer cephuser]# curl http://10.0.206.210:9926/metrics | grep ceph_AsyncMessenger_Worker_msgr_ % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 255# HELP ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts Number of connections closed due to idleness # TYPE ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts counter ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="0"} 65 ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="1"} 133 ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="2"} 0 # HELP ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts Number of not yet ready connections declared as dead # TYPE ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts counter ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="0"} 0 ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="1"} 0 ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="2"} 0 41 100 25541 0 0 8314k 0 --:--:-- --:--:-- --:--:-- 8314k As this PR currently been used to only metrics exposed by ceph-exporter able to see the mentioned Network connection metrics as per the PR. Verified using, ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7780
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days