1333809 – [RFE] Improve OSD heartbeat_check log message by including host name or IP (besides OSD numbers)

Bug 1333809 - [RFE] Improve OSD heartbeat_check log message by including host name or IP (besides OSD numbers)

Summary: [RFE] Improve OSD heartbeat_check log message by including host name or IP (b...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	2.2
Assignee:	Vikhyat Umrao
QA Contact:	shylesh
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Depends On:	1402076
Blocks:	1258382 1412948
TreeView+	depends on / blocked

Reported:	2016-05-06 13:11 UTC by Vikhyat Umrao
Modified:	2019-10-10 12:04 UTC (History)
CC List:	10 users (show)
Fixed In Version:	RHEL: ceph-10.2.5-16.el7cp Ubuntu: ceph_10.2.5-9redhat1xenial
Doc Type:	Enhancement
Doc Text:	.OSD heartbeat_check log messages now include IP addresses The OSD `heartbeat_check` log messages now include IP addresses of the OSD nodes. This enhancement improves identification of the OSD nodes in the Ceph logs. For example, it is no longer necessary to look up which IP correlates to which OSD node (OSD.<number>) for the `heartbeat_check` message in the log.
Clone Of:
Environment:
Last Closed:	2017-03-14 15:44:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	15762	None	None	None	2016-05-06 13:17:07 UTC
Ceph Project Bug Tracker	18657	None	None	None	2017-01-25 03:13:47 UTC
Red Hat Knowledge Base (Solution)	2306041	None	None	None	2016-05-27 06:44:40 UTC
Red Hat Product Errata	RHBA-2017:0514	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.2 bug fix and enhancement update	2017-03-21 07:24:26 UTC

Description Vikhyat Umrao 2016-05-06 13:11:16 UTC

Description of problem:
[RFE] Improve OSD heartbeat_check log message by including host name (besides OSD numbers)


When diagnosing problems in Ceph related to heartbeat we would appreciate to
be able to see if all reported failures are from the same host (or from different host).

That will make it easier to see where the problems are.

Actual Behavior: 
------------------------------

This is printed:

May  3 01:17:54  ceph-osdnode1 bash: 2016-05-03 01:17:54.280170 7f63eee57700 -1 osd.10 1748 heartbeat_check: no reply from osd.24 since back 2016-05-03 01:17:34.148919 front 2016-05-03 01:17:34.148919 (cutoff 2016-05-03 01:17:34.280165)

Proposed behavior:
heartbeat_check: no reply from ceph-osdnode2:osd.24


Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3.2

Comment 1 Vikhyat Umrao 2016-05-06 13:17:07 UTC

Upstream Tracker : http://tracker.ceph.com/issues/15762

Comment 2 Vikhyat Umrao 2016-05-06 13:36:47 UTC

File :  src/osd/OSD.cc
void OSD::heartbeat_check()
{

[..]

else {
        derr << "heartbeat_check: no reply from osd." << p->first
             << " since back " << p->second.last_rx_back
             << " front " << p->second.last_rx_front
             << " (cutoff " << cutoff << ")" << dendl;
        // fail
        failure_queue[p->first] = MIN(p->second.last_rx_back, p->second.last_rx_front);
      }

[..]
}

^^ This is the message which needs enhancement. 

OSD numbers comes from map<int,HeartbeatInfo>::iterator p.

for (map<int,HeartbeatInfo>::iterator p = heartbeat_peers.begin();
       p != heartbeat_peers.end();
       ++p) {

and If we check HeartbeatInfo structure :

 /// information about a heartbeat peer
  struct HeartbeatInfo {
    int peer;           ///< peer
    ConnectionRef con_front;   ///< peer connection (front)
    ConnectionRef con_back;    ///< peer connection (back)
    utime_t first_tx;   ///< time we sent our first ping request
    utime_t last_tx;    ///< last time we sent a ping request
    utime_t last_rx_front;  ///< last time we got a ping reply on the front side
    utime_t last_rx_back;   ///< last time we got a ping reply on the back side
    epoch_t epoch;      ///< most recent epoch we wanted this peer

    bool is_unhealthy(utime_t cutoff) {
      return
        ! ((last_rx_front > cutoff ||
            (last_rx_front == utime_t() && (last_tx == utime_t() ||
                                            first_tx > cutoff))) &&
           (last_rx_back > cutoff ||
            (last_rx_back == utime_t() && (last_tx == utime_t() ||
                                           first_tx > cutoff))));
    }
    bool is_healthy(utime_t cutoff) {
      return last_rx_front > cutoff && last_rx_back > cutoff;
    }

  };

- I am not sure if we can get host name from any of them.

Comment 4 Vikhyat Umrao 2016-05-20 11:14:03 UTC

Upstream master branch patch : https://github.com/ceph/ceph/pull/9223

Comment 5 Vikhyat Umrao 2016-06-16 04:01:15 UTC

Upstream Jewel Backport Tracker : http://tracker.ceph.com/issues/16337
Upstream Jewel Backport PR : https://github.com/ceph/ceph/pull/9739

Comment 7 Vikhyat Umrao 2017-01-04 22:26:29 UTC

$ git tag --contains d5c77c8ff6ad0e4b4242669e4f218f607f433310
v10.2.4
v10.2.5

$ git tag --contains 47605a2fdde65265e3c4dc60aac206c6ae712be5
v10.2.4
v10.2.5

This is in POST and both commits are in v10.2.5 and I hope in 2.2 we are taking 10.2.5 any specific reason moving this bug to 2.3?

Comment 8 Ken Dreyer (Red Hat) 2017-01-05 22:23:48 UTC

If this is fixed upstream in v10.2.5, it was probably an oversight to target this to 2.3. Re-targeting to 2.2.

Comment 10 shylesh 2017-01-25 02:34:43 UTC


Already discussed with Vikhyat and he already has the fix,log message is not as expected in the current build i.e. 10.2.5-7.el7cp.x86_64

2017-01-24 17:30:34.860120 7f97aa1a3700 -1 osd.3 66 heartbeat_check: no reply from 0x7f97c408ce10 osd.1 since back 2017-01-24 17:29:49.108840 front 2017-01-24 17:29:49.108840 (cutoff 2017-01-24 17:29:54.850786)

Here host ip is not shown properly, hence moving this back to assigned.

Comment 11 Vikhyat Umrao 2017-01-25 03:13:10 UTC

I created this upstream tracker[1] for this issue. In the master branch, this enhancement is working fine, the issue is only in jewel backport.

Tomorrow I will send a PR for upstream jewel branch.

[1] http://tracker.ceph.com/issues/18657

Comment 14 Christina Meno 2017-01-25 16:11:25 UTC

Looks like it got merged.
Would you please get the cherry-pick done?

Comment 15 Vikhyat Umrao 2017-01-25 18:18:32 UTC

(In reply to Gregory Meno from comment #14)
> Looks like it got merged.
> Would you please get the cherry-pick done?

Hi Gregory, 

That was the old patch in comment#5 which has the issue. New PR[1] is still not merged in the jewel and tracker for this issue is given in comment#11.

This issue is only present in jewel backport.

[1] https://github.com/ceph/ceph/pull/13108

Thank you,
Vikhyat

Comment 27 shylesh 2017-02-07 10:04:13 UTC

Works for IPv6 as well




2017-02-07 07:35:22.477552 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529)
2017-02-07 07:35:22.477557 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6806 osd.7 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529)
2017-02-07 07:35:23.460074 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1bec]:6807 osd.0 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073)
2017-02-07 07:35:23.460085 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073)

Marking as verified.

Comment 28 Christina Meno 2017-02-21 22:19:25 UTC

I added doc text

Comment 30 Christina Meno 2017-02-24 17:21:22 UTC

Bara,

Looks good with the edit I made.

thank you

Comment 32 Christina Meno 2017-02-28 18:05:42 UTC

Great job!

Comment 34 errata-xmlrpc 2017-03-14 15:44:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html

Note You need to log in before you can comment on or make changes to this bug.