Bug 1333809

Summary: [RFE] Improve OSD heartbeat_check log message by including host name or IP (besides OSD numbers)
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Vikhyat Umrao <vumrao>
Status: CLOSED ERRATA QA Contact: shylesh <shmohan>
Severity: medium Docs Contact: Bara Ancincova <bancinco>
Priority: medium    
Version: 1.3.2CC: ceph-eng-bugs, dzafman, gmeno, hnallurv, jdurgin, kchai, kdreyer, sjust, uboppana, vumrao
Target Milestone: rcKeywords: FutureFeature, Triaged
Target Release: 2.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.5-16.el7cp Ubuntu: ceph_10.2.5-9redhat1xenial Doc Type: Enhancement
Doc Text:
.OSD heartbeat_check log messages now include IP addresses The OSD `heartbeat_check` log messages now include IP addresses of the OSD nodes. This enhancement improves identification of the OSD nodes in the Ceph logs. For example, it is no longer necessary to look up which IP correlates to which OSD node (OSD.<number>) for the `heartbeat_check` message in the log.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-14 15:44:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1402076    
Bug Blocks: 1258382, 1412948    

Description Vikhyat Umrao 2016-05-06 13:11:16 UTC
Description of problem:
[RFE] Improve OSD heartbeat_check log message by including host name (besides OSD numbers)


When diagnosing problems in Ceph related to heartbeat we would appreciate to
be able to see if all reported failures are from the same host (or from different host).

That will make it easier to see where the problems are.

Actual Behavior: 
------------------------------

This is printed:

May  3 01:17:54  ceph-osdnode1 bash: 2016-05-03 01:17:54.280170 7f63eee57700 -1 osd.10 1748 heartbeat_check: no reply from osd.24 since back 2016-05-03 01:17:34.148919 front 2016-05-03 01:17:34.148919 (cutoff 2016-05-03 01:17:34.280165)

Proposed behavior:
heartbeat_check: no reply from ceph-osdnode2:osd.24


Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3.2

Comment 1 Vikhyat Umrao 2016-05-06 13:17:07 UTC
Upstream Tracker : http://tracker.ceph.com/issues/15762

Comment 2 Vikhyat Umrao 2016-05-06 13:36:47 UTC
File :  src/osd/OSD.cc
void OSD::heartbeat_check()
{

[..]

else {
        derr << "heartbeat_check: no reply from osd." << p->first
             << " since back " << p->second.last_rx_back
             << " front " << p->second.last_rx_front
             << " (cutoff " << cutoff << ")" << dendl;
        // fail
        failure_queue[p->first] = MIN(p->second.last_rx_back, p->second.last_rx_front);
      }

[..]
}

^^ This is the message which needs enhancement. 

OSD numbers comes from map<int,HeartbeatInfo>::iterator p.

for (map<int,HeartbeatInfo>::iterator p = heartbeat_peers.begin();
       p != heartbeat_peers.end();
       ++p) {

and If we check HeartbeatInfo structure :

 /// information about a heartbeat peer
  struct HeartbeatInfo {
    int peer;           ///< peer
    ConnectionRef con_front;   ///< peer connection (front)
    ConnectionRef con_back;    ///< peer connection (back)
    utime_t first_tx;   ///< time we sent our first ping request
    utime_t last_tx;    ///< last time we sent a ping request
    utime_t last_rx_front;  ///< last time we got a ping reply on the front side
    utime_t last_rx_back;   ///< last time we got a ping reply on the back side
    epoch_t epoch;      ///< most recent epoch we wanted this peer

    bool is_unhealthy(utime_t cutoff) {
      return
        ! ((last_rx_front > cutoff ||
            (last_rx_front == utime_t() && (last_tx == utime_t() ||
                                            first_tx > cutoff))) &&
           (last_rx_back > cutoff ||
            (last_rx_back == utime_t() && (last_tx == utime_t() ||
                                           first_tx > cutoff))));
    }
    bool is_healthy(utime_t cutoff) {
      return last_rx_front > cutoff && last_rx_back > cutoff;
    }

  };

- I am not sure if we can get host name from any of them.

Comment 4 Vikhyat Umrao 2016-05-20 11:14:03 UTC
Upstream master branch patch : https://github.com/ceph/ceph/pull/9223

Comment 5 Vikhyat Umrao 2016-06-16 04:01:15 UTC
Upstream Jewel Backport Tracker : http://tracker.ceph.com/issues/16337
Upstream Jewel Backport PR : https://github.com/ceph/ceph/pull/9739

Comment 7 Vikhyat Umrao 2017-01-04 22:26:29 UTC
$ git tag --contains d5c77c8ff6ad0e4b4242669e4f218f607f433310
v10.2.4
v10.2.5

$ git tag --contains 47605a2fdde65265e3c4dc60aac206c6ae712be5
v10.2.4
v10.2.5

This is in POST and both commits are in v10.2.5 and I hope in 2.2 we are taking 10.2.5 any specific reason moving this bug to 2.3?

Comment 8 Ken Dreyer (Red Hat) 2017-01-05 22:23:48 UTC
If this is fixed upstream in v10.2.5, it was probably an oversight to target this to 2.3. Re-targeting to 2.2.

Comment 10 shylesh 2017-01-25 02:34:43 UTC

Already discussed with Vikhyat and he already has the fix,log message is not as expected in the current build i.e. 10.2.5-7.el7cp.x86_64

2017-01-24 17:30:34.860120 7f97aa1a3700 -1 osd.3 66 heartbeat_check: no reply from 0x7f97c408ce10 osd.1 since back 2017-01-24 17:29:49.108840 front 2017-01-24 17:29:49.108840 (cutoff 2017-01-24 17:29:54.850786)

Here host ip is not shown properly, hence moving this back to assigned.

Comment 11 Vikhyat Umrao 2017-01-25 03:13:10 UTC
I created this upstream tracker[1] for this issue. In the master branch, this enhancement is working fine, the issue is only in jewel backport.

Tomorrow I will send a PR for upstream jewel branch.

[1] http://tracker.ceph.com/issues/18657

Comment 14 Christina Meno 2017-01-25 16:11:25 UTC
Looks like it got merged.
Would you please get the cherry-pick done?

Comment 15 Vikhyat Umrao 2017-01-25 18:18:32 UTC
(In reply to Gregory Meno from comment #14)
> Looks like it got merged.
> Would you please get the cherry-pick done?

Hi Gregory, 

That was the old patch in comment#5 which has the issue. New PR[1] is still not merged in the jewel and tracker for this issue is given in comment#11.

This issue is only present in jewel backport.

[1] https://github.com/ceph/ceph/pull/13108

Thank you,
Vikhyat

Comment 27 shylesh 2017-02-07 10:04:13 UTC
Works for IPv6 as well




2017-02-07 07:35:22.477552 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529)
2017-02-07 07:35:22.477557 7fbf57e04700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6806 osd.7 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:02.477529)
2017-02-07 07:35:23.460074 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1bec]:6807 osd.0 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073)
2017-02-07 07:35:23.460085 7fbf3e438700 -1 osd.4 63 heartbeat_check: no reply from [2620:52:0:880:225:90ff:fefc:1cca]:6810 osd.5 ever on either front or back, first ping sent 2017-02-06 15:41:43.766036 (cutoff 2017-02-07 07:35:03.460073)

Marking as verified.

Comment 28 Christina Meno 2017-02-21 22:19:25 UTC
I added doc text

Comment 30 Christina Meno 2017-02-24 17:21:22 UTC
Bara,

Looks good with the edit I made.

thank you

Comment 32 Christina Meno 2017-02-28 18:05:42 UTC
Great job!

Comment 34 errata-xmlrpc 2017-03-14 15:44:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html