Bug 1340004

Summary:

Seeing lots of "heartbeat_map" messages when stopping an MDS Server

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Tanay Ganguly <tganguly>

Component:

CephFS

Assignee:

John Spray <john.spray>

Status:

CLOSED ERRATA

QA Contact:

ceph-qe-bugs <ceph-qe-bugs>

Severity:

medium

Docs Contact:

Priority:

low

Version:

2.0

CC:

ceph-eng-bugs, hnallurv, john.spray, kdreyer, kurs, nlevine, pdonnell, rperiyas, uboppana

Target Milestone:

Keywords:

Rebase

Target Release:

2.1

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-22 19:26:02 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
72K Lines of the log file	none

Description Tanay Ganguly 2016-05-26 10:36:01 UTC

Description of problem:
While testing multiple start/stop of mds service, i am seeing lots of 
"heartbeat_map is_healthy 'MDSRank' had timed out after 15" messages

Version-Release number of selected component (if applicable):
ceph version 10.2.1-6.el7cp

How reproducible:
3/5

Steps to Reproduce:
1. Have 3 MDS configured.
2. Mounted the FS, wrote some data 
ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED 
    11055G     5723G        5332G         48.23 
POOLS:
    NAME                ID     USED       %USED     MAX AVAIL     OBJECTS 
    cephfs_data         1       2087G     56.63         1757G      697197 
    cephfs_metadata     2      49972k         0         1757G       13972 

3. In a loop start/stop the active mds server.

Actual results:
Start/Stop is happening fine, but sometimes it is taking a bit long. That time i am seeing these messages.

2016-05-26 09:41:59.509772 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:04.509874 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:09.509952 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:14.510116 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:19.510420 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:24.510522 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:29.510654 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:34.510734 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15


Expected results:
Shouldn't see this message

Additional info:

Comment 2 John Spray 2016-05-26 10:40:42 UTC

How are you starting and stopping the MDS (what command?)

Please set debug mds = 20, and capture the log from the point at which you ask the MDS to stop, to the point where it eventually stops (so that we can see what else is going on while the heartbeat_map messages are coming).

Comment 3 Tanay Ganguly 2016-05-26 12:06:13 UTC

For start/stopping i am using the below command:

systemctl stop ceph-mds
systemctl start ceph-mds


I had debug enabled , but i was unable to attach the logs as the file size was huge. As discussed over IRC, chopped the file having 100000 lines and attached.

Steps:
1. I had 3 healthy mds running.
2. Started IO (Performing an Rsync to the mounted directory)
3. Stopped the mds2 ( which was currently being active) , saw the heartbeat_map message on mds2.log. So we have 2 MDS running.
4. Stopped the mds0  ( which was currently being active), again saw the same message in mds0.log. So we have 2 MDS running.


But all through IO is continuing.

Comment 4 Tanay Ganguly 2016-05-26 12:13:10 UTC

Created attachment 1161946 [details]
72K Lines of the log file

Comment 5 John Spray 2016-06-24 10:16:42 UTC

Tanay: can you confirm that the Ceph cluster in use did not have any customisation of the "ms type" config setting?  We have a separate report upstream of a similar issue when "ms type = async" was set http://tracker.ceph.com/issues/16396

Comment 6 Tanay Ganguly 2016-06-29 05:57:59 UTC

John:
No i didn't had any customization

Comment 7 Patrick Donnelly 2016-06-30 20:49:37 UTC

Hi Tanay, I'm looking into this bug and wanted to let you know I've been able to reproduce it:

http://tracker.ceph.com/issues/16042#note-8

I'll keep you updated with our progress.

Comment 8 Patrick Donnelly 2016-07-21 16:32:40 UTC

Tanay, the fix has been merged into master. A backport of the fix to Jewel is pending.

Comment 10 Ken Dreyer (Red Hat) 2016-09-21 12:11:27 UTC

The fix will be in upstream's v10.2.3.

Comment 14 Ramakrishnan Periyasamy 2016-10-07 10:51:21 UTC

Bug Verified, Not seeing any issue while restarting MDS services.

ceph version:
ceph version 10.2.3-4.el7cp (852125d923e43802a51f681ca2ae9e721eec91ca)

RHEL Version:
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Kernel Version:
Linux node2 3.10.0-511.el7.x86_64 #1 SMP Wed Sep 28 12:25:44 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Comment 16 errata-xmlrpc 2016-11-22 19:26:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html