Description of problem: [RFE] Improve OSD heartbeat_check log message by including host name (besides OSD numbers) When diagnosing problems in Ceph related to heartbeat we would appreciate to be able to see if all reported failures are from the same host (or from different host). That will make it easier to see where the problems are. Actual Behavior: ------------------------------ This is printed: May 3 01:17:54 ceph-osdnode1 bash: 2016-05-03 01:17:54.280170 7f63eee57700 -1 osd.10 1748 heartbeat_check: no reply from osd.24 since back 2016-05-03 01:17:34.148919 front 2016-05-03 01:17:34.148919 (cutoff 2016-05-03 01:17:34.280165) Proposed behavior: heartbeat_check: no reply from ceph-osdnode2:osd.24 Version-Release number of selected component (if applicable): Red Hat Ceph Storage 1.3.2
Upstream Tracker : http://tracker.ceph.com/issues/15762
File : src/osd/OSD.cc void OSD::heartbeat_check() { [..] else { derr << "heartbeat_check: no reply from osd." << p->first << " since back " << p->second.last_rx_back << " front " << p->second.last_rx_front << " (cutoff " << cutoff << ")" << dendl; // fail failure_queue[p->first] = MIN(p->second.last_rx_back, p->second.last_rx_front); } [..] } ^^ This is the message which needs enhancement. OSD numbers comes from map<int,HeartbeatInfo>::iterator p. for (map<int,HeartbeatInfo>::iterator p = heartbeat_peers.begin(); p != heartbeat_peers.end(); ++p) { and If we check HeartbeatInfo structure : /// information about a heartbeat peer struct HeartbeatInfo { int peer; ///< peer ConnectionRef con_front; ///< peer connection (front) ConnectionRef con_back; ///< peer connection (back) utime_t first_tx; ///< time we sent our first ping request utime_t last_tx; ///< last time we sent a ping request utime_t last_rx_front; ///< last time we got a ping reply on the front side utime_t last_rx_back; ///< last time we got a ping reply on the back side epoch_t epoch; ///< most recent epoch we wanted this peer bool is_unhealthy(utime_t cutoff) { return ! ((last_rx_front > cutoff || (last_rx_front == utime_t() && (last_tx == utime_t() || first_tx > cutoff))) && (last_rx_back > cutoff || (last_rx_back == utime_t() && (last_tx == utime_t() || first_tx > cutoff)))); } bool is_healthy(utime_t cutoff) { return last_rx_front > cutoff && last_rx_back > cutoff; } }; - I am not sure if we can get host name from any of them.
Upstream master branch patch : https://github.com/ceph/ceph/pull/9223
Upstream Jewel Backport Tracker : http://tracker.ceph.com/issues/16337 Upstream Jewel Backport PR : https://github.com/ceph/ceph/pull/9739
$ git tag --contains d5c77c8ff6ad0e4b4242669e4f218f607f433310 v10.2.4 v10.2.5 $ git tag --contains 47605a2fdde65265e3c4dc60aac206c6ae712be5 v10.2.4 v10.2.5 This is in POST and both commits are in v10.2.5 and I hope in 2.2 we are taking 10.2.5 any specific reason moving this bug to 2.3?
If this is fixed upstream in v10.2.5, it was probably an oversight to target this to 2.3. Re-targeting to 2.2.
Already discussed with Vikhyat and he already has the fix,log message is not as expected in the current build i.e. 10.2.5-7.el7cp.x86_64 2017-01-24 17:30:34.860120 7f97aa1a3700 -1 osd.3 66 heartbeat_check: no reply from 0x7f97c408ce10 osd.1 since back 2017-01-24 17:29:49.108840 front 2017-01-24 17:29:49.108840 (cutoff 2017-01-24 17:29:54.850786) Here host ip is not shown properly, hence moving this back to assigned.
I created this upstream tracker[1] for this issue. In the master branch, this enhancement is working fine, the issue is only in jewel backport. Tomorrow I will send a PR for upstream jewel branch. [1] http://tracker.ceph.com/issues/18657
Looks like it got merged. Would you please get the cherry-pick done?
(In reply to Gregory Meno from comment #14) > Looks like it got merged. > Would you please get the cherry-pick done? Hi Gregory, That was the old patch in comment#5 which has the issue. New PR[1] is still not merged in the jewel and tracker for this issue is given in comment#11. This issue is only present in jewel backport. [1] https://github.com/ceph/ceph/pull/13108 Thank you, Vikhyat
Works for IPv6 as well 2017-02-07 07:35:22.477552 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529) 2017-02-07 07:35:22.477557 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6806 osd.7 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529) 2017-02-07 07:35:23.460074 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1bec]:6807 osd.0 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073) 2017-02-07 07:35:23.460085 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073) Marking as verified.
I added doc text
Bara, Looks good with the edit I made. thank you
Great job!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html