Bug 2240766 - [RFE] RBD mirroring (and related processes) needs a connection health metric to its peers
Summary: [RFE] RBD mirroring (and related processes) needs a connection health metric ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.0
Assignee: Pere Diaz Bou
QA Contact: Sunil Angadi
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-09-26 11:29 UTC by Sunil Angadi
Modified: 2024-04-12 04:25 UTC (History)
22 users (show)

Fixed In Version: ceph-18.2.0-50.el9cp
Doc Type: Enhancement
Doc Text:
Clone Of: 2141003
Environment:
Last Closed: 2023-12-13 15:24:06 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7549 0 None None None 2023-09-26 11:30:39 UTC
Red Hat Product Errata RHBA-2023:7780 0 None None None 2023-12-13 15:24:09 UTC

Description Sunil Angadi 2023-09-26 11:29:44 UTC
+++ This bug was initially created as a clone of Bug #2141003 +++

+++ This bug was initially created as a clone of Bug #2118627 +++

<huge snip>

Core problem was 2 fold:
1. Incorrect network configuration, which resulted in RBD mirror failing to establish a healthy connection with its peers:
  - This is resolved from a troubleshooting POV using rook krew plugins and RH insights rules to troubleshoot the issue faster

2. Intermittent network connectivity issues, which MAY result in mirroring delays and schedule lags
  - Schedule lags are covered by scraping rbd mirror image status information and representing this to the user as a lastSyncTime for a volume (or a group of volumes)
  - This BZ is to request an additional health metric from Ceph for peer connectivity to help observe and accelerate troubleshooting these intermittent network issues (by not looking at RBD peer warning states, but focusing on the network stack instead)

Part of the discussion relevant to this is pasted from the original BZ below:

(In reply to Shyamsundar from comment #24)
> Thanks Annette!
> 
> @idryomov we would potentially need a health metric per peer from
> the RBD mirror daemon that can demonstrate peer connectivity. For example
> over time if peer connectivity has hiccups or is down for a longer period of
> time etc. it is relevant to look at this health metric and ensure
> connectivity to a peer was active or had issues. The health should also
> include actual ceph connectivity and not the generic ping.
> 
> Should we open a separate BZ for this and track it as a dependency here?

Hi Shyam,

Sure, feel free to open an RHCS BZ.  As discussed in the meeting, this isn't going to be trivial to implement but we may be able to slice up the problem space and derive something from secondary indicators such as the startup time or, post startup, things like overall throughput.

That said, any such derivation would be in direct contradiction with the ask

(In reply to Shyamsundar from comment #25)
> This hence potentially brings this down to RBD mirror daemon to report a
> per-peer connectivity health metric. Technically the current mirror pool
> health is showing up a WARNING but finer distinction of the WARNING as
> applicable would help in such cases.

... because that finer distinction is exactly the tricky part here.

--- Additional comment from Mudit Agarwal on 2022-11-07 19:29:10 EST ---

Not a TP blocker, Shyam please open a RHCS BZ

--- Additional comment from Red Hat Bugzilla on 2022-12-31 20:04:19 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 22:37:30 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 23:39:24 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 23:45:33 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 05:47:46 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 05:47:58 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 05:48:08 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 06:02:29 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:32:30 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:44:15 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:45:34 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Red Hat Bugzilla on 2023-01-31 23:39:06 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-31 23:41:02 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Ilya Dryomov on 2023-03-13 20:29:55 UTC ---

Hi JuanMi,

I assume Pere's https://github.com/ceph/ceph/pull/50393 is the first step towards this?

--- Additional comment from Ilya Dryomov on 2023-04-21 09:09:35 UTC ---



--- Additional comment from Juan Miguel Olmo on 2023-05-11 15:54:53 UTC ---

yes Ilya:
https://github.com/ceph/ceph/pull/50393 is the first step towards solve this.

--- Additional comment from Greg Farnum on 2023-05-11 16:04:55 UTC ---

There's nothing about this bz or upstream PR that merits it being in POST.

--- Additional comment from Red Hat Bugzilla on 2023-07-31 21:50:20 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-07-31 21:50:29 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-08-03 08:29:09 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Mudit Agarwal on 2023-09-12 10:26:43 UTC ---

AFAIK this is just a backport, any reasons we are not taking this in 6.1z2? This is a blocker for Regional DR

--- Additional comment from Juan Miguel Olmo on 2023-09-12 10:34:36 UTC ---

Backport to reef on going:
https://github.com/ceph/ceph/pull/53033

--- Additional comment from Ilya Dryomov on 2023-09-12 12:44:58 UTC ---

(In reply to Juan Miguel Olmo from comment #22)
> Backport to reef on going:
> https://github.com/ceph/ceph/pull/53033

Hi JuanMi,

This is 6.1, so upstream backport to reef is irrelevant.  This would be a downstream-only backport to quincy.

If there is no other BZ you were planning to use (I'm completely lost in monitoring/metrics BZs), let's move this one back to 6.1z2.

--- Additional comment from Scott Ostapovicz on 2023-09-12 12:50:33 UTC ---

Retargeted to 6.1 z2

--- Additional comment from Sunil Angadi on 2023-09-14 10:39:29 UTC ---

Hi Pere,
when can we expect this bz to ON_QA?

also can you please provide the steps to verify this bz?
what actions make metrics counter value get triggered?
Is it just metrics or alerts also being covered?

--- Additional comment from Pere Diaz Bou on 2023-09-15 08:51:12 UTC ---

Hi Sunil,

Currently these are only metrics exposed by ceph-exporter and not from the prometheus mgr module.

You can easily find them by deploying a cluster with ceph-exporter and the looking for metrics that include "ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts"  and "ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts".

```
# on vstart.sh setup. On cephadm setup let cephadm deploy ceph-exporter
bin/ceph-exporter --sock-dir $PWD/asok --port 9999 --conf ceph.conf --addrs 127.0.0.1

# find new metrics
curl localhost:9999/metrics | grep ceph_AsyncMessenger_Worker_msgr_
``` 

you will find something like:
```
ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="1"} 0
ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="2"} 0
# HELP ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts Number of not yet ready connections declared as dead
# TYPE ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts counter
ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="0"} 0
ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="1"} 0
```

--- Additional comment from Sunil Angadi on 2023-09-20 05:50:36 UTC ---

Hi Build team,
can you please get a build for this bz to ON_QA soon?

--- Additional comment from  on 2023-09-20 06:01:27 UTC ---

(In reply to Sunil Angadi from comment #27)
> Hi Build team,
> can you please get a build for this bz to ON_QA soon?

Sorry, I missed this, but are the commits even downstream on ceph-6.1-rhel-patches?

https://gitlab.cee.redhat.com/ceph/ceph/-/commits/ceph-6.1-rhel-patches

Thomas

--- Additional comment from Ilya Dryomov on 2023-09-20 06:41:04 UTC ---

(In reply to tserlin from comment #28)
> (In reply to Sunil Angadi from comment #27)
> > Hi Build team,
> > can you please get a build for this bz to ON_QA soon?
> 
> Sorry, I missed this, but are the commits even downstream on
> ceph-6.1-rhel-patches?
> 
> https://gitlab.cee.redhat.com/ceph/ceph/-/commits/ceph-6.1-rhel-patches

Hi Thomas,

No, the downstream MR is still being reviewed by Radoslaw (the component on this BZ is misleading -- it originated as a Regional DR requirement, hence RBD-Mirror, but change ended up being entirely in RADOS).

--- Additional comment from Pere Diaz Bou on 2023-09-21 10:03:28 UTC ---

As an update, we found out while testing https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/379?commit_id=fa70882a9b477e74233fe8ce2fed0218e83ec09c (the mr of this bz) that important labels like `ceph_daemon` were missing from labeled perfcounters. It should be ok to merge into 6.1-patches but it should be noted that it will be unusable until a new fix for that is merged and backported https://github.com/ceph/ceph/pull/53523.

--- Additional comment from Ilya Dryomov on 2023-09-21 12:53:16 UTC ---

Pushed to ceph-6.1-rhel-patches based on Radoslaw's approval in https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/379.

--- Additional comment from errata-xmlrpc on 2023-09-21 17:17:46 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHSA-2023:118540-05
https://errata.devel.redhat.com/advisory/118540

--- Additional comment from errata-xmlrpc on 2023-09-21 17:17:53 UTC ---

This bug has been added to advisory RHSA-2023:118540 by Thomas Serlin (tserlin)

--- Additional comment from Sunil Angadi on 2023-09-22 07:21:53 UTC ---



(In reply to Pere Diaz Bou from comment #26)
> Hi Sunil,
> 
> Currently these are only metrics exposed by ceph-exporter and not from the
> prometheus mgr module.
> 
> You can easily find them by deploying a cluster with ceph-exporter and the
> looking for metrics that include
> "ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts"  and
> "ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts".
> 
> ```
> # on vstart.sh setup. On cephadm setup let cephadm deploy ceph-exporter
> bin/ceph-exporter --sock-dir $PWD/asok --port 9999 --conf ceph.conf --addrs
> 127.0.0.1
> 
> # find new metrics
> curl localhost:9999/metrics | grep ceph_AsyncMessenger_Worker_msgr_
> ``` 
> 
> you will find something like:
> ```
> ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="1"} 0
> ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="2"} 0
> # HELP ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts Number of
> not yet ready connections declared as dead
> # TYPE ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts counter
> ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="0"} 0
> ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="1"} 0
> ```

Tested using,
ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable)

Hi pere,

after deploying ceph-exporter service, it's not running on any port

[ceph: root@ceph-rbd1-sbz-i8apyw-node1-installer /]# ceph orch ps
NAME                                                HOST                                  PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION           IMAGE ID      CONTAINER ID
alertmanager.ceph-rbd1-sbz-i8apyw-node4             ceph-rbd1-sbz-i8apyw-node4            *:9093,9094  running (28m)   100s ago  29m    27.3M        -  0.24.0            1616eb1ba330  293193d384a7
ceph-exporter.ceph-rbd1-sbz-i8apyw-node1-installer  ceph-rbd1-sbz-i8apyw-node1-installer               running (22m)   100s ago  22m    6089k        -  17.2.6-146.el9cp  e2affc76d032  4895a43f30b7
grafana.ceph-rbd1-sbz-i8apyw-node1-installer        ceph-rbd1-sbz-i8apyw-node1-installer  *:3000       running (28m)   100s ago  28m    86.8M        -  9.4.7             75cd3ecd64ca  c763681e1b08
mgr.ceph-rbd1-sbz-i8apyw-node1-installer.hksbsa     ceph-rbd1-sbz-i8apyw-node1-installer  *:9283       running (88m)   100s ago  88m     487M        -  17.2.6-146.el9cp  e2affc76d032  df120c8228a0
mgr.ceph-rbd1-sbz-i8apyw-node3.iltfsf               ceph-rbd1-sbz-i8apyw-node3            *:8443       running (86m)     8m ago  86m     407M        -  17.2.6-146.el9cp  e2affc76d032  8d30677637d3

ceph-exporter is running on admin node i.e node1
[ceph: root@ceph-rbd1-sbz-i8apyw-node1-installer /]# curl http://10.0.206.210:9999/metrics
curl: (7) Failed to connect to 10.0.206.210 port 9999: Connection refused

so i am not able to get any of the above-mentioned metrics

can you please check?

--- Additional comment from Sunil Angadi on 2023-09-25 07:24:38 UTC ---

Checked with Pere and Nizam,

the port which is running on ceph-exporter is 9926,

[root@ceph-rbd1-sbz-i8apyw-node1-installer cephuser]# curl http://10.0.206.210:9926/metrics | grep ceph_AsyncMessenger_Worker_msgr_
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 255# HELP ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts Number of connections closed due to idleness
# TYPE ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts counter
ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="0"} 65
ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="1"} 133
ceph_AsyncMessenger_Worker_msgr_connection_idle_timeouts{id="2"} 0
# HELP ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts Number of not yet ready connections declared as dead
# TYPE ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts counter
ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="0"} 0
ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="1"} 0
ceph_AsyncMessenger_Worker_msgr_connection_ready_timeouts{id="2"} 0
41  100 25541    0     0  8314k      0 --:--:-- --:--:-- --:--:-- 8314k

As this PR currently been used to only metrics exposed by ceph-exporter
able to see the mentioned Network connection metrics as per the PR.

Verified using,
ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable)

Comment 6 errata-xmlrpc 2023-12-13 15:24:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780

Comment 7 Red Hat Bugzilla 2024-04-12 04:25:31 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.