Bug 1333809
Summary: | [RFE] Improve OSD heartbeat_check log message by including host name or IP (besides OSD numbers) | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
Component: | RADOS | Assignee: | Vikhyat Umrao <vumrao> |
Status: | CLOSED ERRATA | QA Contact: | shylesh <shmohan> |
Severity: | medium | Docs Contact: | Bara Ancincova <bancinco> |
Priority: | medium | ||
Version: | 1.3.2 | CC: | ceph-eng-bugs, dzafman, gmeno, hnallurv, jdurgin, kchai, kdreyer, sjust, uboppana, vumrao |
Target Milestone: | rc | Keywords: | FutureFeature, Triaged |
Target Release: | 2.2 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-10.2.5-16.el7cp Ubuntu: ceph_10.2.5-9redhat1xenial | Doc Type: | Enhancement |
Doc Text: |
.OSD heartbeat_check log messages now include IP addresses
The OSD `heartbeat_check` log messages now include IP addresses of the OSD nodes. This enhancement improves identification of the OSD nodes in the Ceph logs. For example, it is no longer necessary to look up which IP correlates to which OSD node (OSD.<number>) for the `heartbeat_check` message in the log.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-03-14 15:44:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1402076 | ||
Bug Blocks: | 1258382, 1412948 |
Description
Vikhyat Umrao
2016-05-06 13:11:16 UTC
Upstream Tracker : http://tracker.ceph.com/issues/15762 File : src/osd/OSD.cc void OSD::heartbeat_check() { [..] else { derr << "heartbeat_check: no reply from osd." << p->first << " since back " << p->second.last_rx_back << " front " << p->second.last_rx_front << " (cutoff " << cutoff << ")" << dendl; // fail failure_queue[p->first] = MIN(p->second.last_rx_back, p->second.last_rx_front); } [..] } ^^ This is the message which needs enhancement. OSD numbers comes from map<int,HeartbeatInfo>::iterator p. for (map<int,HeartbeatInfo>::iterator p = heartbeat_peers.begin(); p != heartbeat_peers.end(); ++p) { and If we check HeartbeatInfo structure : /// information about a heartbeat peer struct HeartbeatInfo { int peer; ///< peer ConnectionRef con_front; ///< peer connection (front) ConnectionRef con_back; ///< peer connection (back) utime_t first_tx; ///< time we sent our first ping request utime_t last_tx; ///< last time we sent a ping request utime_t last_rx_front; ///< last time we got a ping reply on the front side utime_t last_rx_back; ///< last time we got a ping reply on the back side epoch_t epoch; ///< most recent epoch we wanted this peer bool is_unhealthy(utime_t cutoff) { return ! ((last_rx_front > cutoff || (last_rx_front == utime_t() && (last_tx == utime_t() || first_tx > cutoff))) && (last_rx_back > cutoff || (last_rx_back == utime_t() && (last_tx == utime_t() || first_tx > cutoff)))); } bool is_healthy(utime_t cutoff) { return last_rx_front > cutoff && last_rx_back > cutoff; } }; - I am not sure if we can get host name from any of them. Upstream master branch patch : https://github.com/ceph/ceph/pull/9223 Upstream Jewel Backport Tracker : http://tracker.ceph.com/issues/16337 Upstream Jewel Backport PR : https://github.com/ceph/ceph/pull/9739 $ git tag --contains d5c77c8ff6ad0e4b4242669e4f218f607f433310 v10.2.4 v10.2.5 $ git tag --contains 47605a2fdde65265e3c4dc60aac206c6ae712be5 v10.2.4 v10.2.5 This is in POST and both commits are in v10.2.5 and I hope in 2.2 we are taking 10.2.5 any specific reason moving this bug to 2.3? If this is fixed upstream in v10.2.5, it was probably an oversight to target this to 2.3. Re-targeting to 2.2. Already discussed with Vikhyat and he already has the fix,log message is not as expected in the current build i.e. 10.2.5-7.el7cp.x86_64 2017-01-24 17:30:34.860120 7f97aa1a3700 -1 osd.3 66 heartbeat_check: no reply from 0x7f97c408ce10 osd.1 since back 2017-01-24 17:29:49.108840 front 2017-01-24 17:29:49.108840 (cutoff 2017-01-24 17:29:54.850786) Here host ip is not shown properly, hence moving this back to assigned. I created this upstream tracker[1] for this issue. In the master branch, this enhancement is working fine, the issue is only in jewel backport. Tomorrow I will send a PR for upstream jewel branch. [1] http://tracker.ceph.com/issues/18657 Looks like it got merged. Would you please get the cherry-pick done? (In reply to Gregory Meno from comment #14) > Looks like it got merged. > Would you please get the cherry-pick done? Hi Gregory, That was the old patch in comment#5 which has the issue. New PR[1] is still not merged in the jewel and tracker for this issue is given in comment#11. This issue is only present in jewel backport. [1] https://github.com/ceph/ceph/pull/13108 Thank you, Vikhyat Works for IPv6 as well 2017-02-07 07:35:22.477552 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529) 2017-02-07 07:35:22.477557 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6806 osd.7 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529) 2017-02-07 07:35:23.460074 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1bec]:6807 osd.0 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073) 2017-02-07 07:35:23.460085 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073) Marking as verified. I added doc text Bara, Looks good with the edit I made. thank you Great job! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html |