1340004 – Seeing lots of "heartbeat_map" messages when stopping an MDS Server

Bug 1340004 - Seeing lots of "heartbeat_map" messages when stopping an MDS Server

Summary: Seeing lots of "heartbeat_map" messages when stopping an MDS Server

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	2.1
Assignee:	John Spray
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-26 10:36 UTC by Tanay Ganguly
Modified:	2017-07-30 15:00 UTC (History)
CC List:	9 users (show)
Fixed In Version:	RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-22 19:26:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
72K Lines of the log file (19.29 MB, text/plain) 2016-05-26 12:13 UTC, Tanay Ganguly	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	16042	0	None	None	None	2016-05-26 12:49:04 UTC
Red Hat Product Errata	RHSA-2016:2815	0	normal	SHIPPED_LIVE	Moderate: Red Hat Ceph Storage security, bug fix, and enhancement update	2017-03-22 02:06:33 UTC

Description Tanay Ganguly 2016-05-26 10:36:01 UTC

Description of problem:
While testing multiple start/stop of mds service, i am seeing lots of 
"heartbeat_map is_healthy 'MDSRank' had timed out after 15" messages

Version-Release number of selected component (if applicable):
ceph version 10.2.1-6.el7cp

How reproducible:
3/5

Steps to Reproduce:
1. Have 3 MDS configured.
2. Mounted the FS, wrote some data 
ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED 
    11055G     5723G        5332G         48.23 
POOLS:
    NAME                ID     USED       %USED     MAX AVAIL     OBJECTS 
    cephfs_data         1       2087G     56.63         1757G      697197 
    cephfs_metadata     2      49972k         0         1757G       13972 

3. In a loop start/stop the active mds server.

Actual results:
Start/Stop is happening fine, but sometimes it is taking a bit long. That time i am seeing these messages.

2016-05-26 09:41:59.509772 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:04.509874 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:09.509952 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:14.510116 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:19.510420 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:24.510522 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:29.510654 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:34.510734 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15


Expected results:
Shouldn't see this message

Additional info:

Comment 2 John Spray 2016-05-26 10:40:42 UTC

How are you starting and stopping the MDS (what command?)

Please set debug mds = 20, and capture the log from the point at which you ask the MDS to stop, to the point where it eventually stops (so that we can see what else is going on while the heartbeat_map messages are coming).

Comment 3 Tanay Ganguly 2016-05-26 12:06:13 UTC

For start/stopping i am using the below command:

systemctl stop ceph-mds
systemctl start ceph-mds


I had debug enabled , but i was unable to attach the logs as the file size was huge. As discussed over IRC, chopped the file having 100000 lines and attached.

Steps:
1. I had 3 healthy mds running.
2. Started IO (Performing an Rsync to the mounted directory)
3. Stopped the mds2 ( which was currently being active) , saw the heartbeat_map message on mds2.log. So we have 2 MDS running.
4. Stopped the mds0  ( which was currently being active), again saw the same message in mds0.log. So we have 2 MDS running.


But all through IO is continuing.

Comment 4 Tanay Ganguly 2016-05-26 12:13:10 UTC

Created attachment 1161946 [details]
72K Lines of the log file

Comment 5 John Spray 2016-06-24 10:16:42 UTC

Tanay: can you confirm that the Ceph cluster in use did not have any customisation of the "ms type" config setting?  We have a separate report upstream of a similar issue when "ms type = async" was set http://tracker.ceph.com/issues/16396

Comment 6 Tanay Ganguly 2016-06-29 05:57:59 UTC

John:
No i didn't had any customization

Comment 7 Patrick Donnelly 2016-06-30 20:49:37 UTC

Hi Tanay, I'm looking into this bug and wanted to let you know I've been able to reproduce it:

http://tracker.ceph.com/issues/16042#note-8

I'll keep you updated with our progress.

Comment 8 Patrick Donnelly 2016-07-21 16:32:40 UTC

Tanay, the fix has been merged into master. A backport of the fix to Jewel is pending.

Comment 10 Ken Dreyer (Red Hat) 2016-09-21 12:11:27 UTC

The fix will be in upstream's v10.2.3.

Comment 14 Ramakrishnan Periyasamy 2016-10-07 10:51:21 UTC

Bug Verified, Not seeing any issue while restarting MDS services.

ceph version:
ceph version 10.2.3-4.el7cp (852125d923e43802a51f681ca2ae9e721eec91ca)

RHEL Version:
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Kernel Version:
Linux node2 3.10.0-511.el7.x86_64 #1 SMP Wed Sep 28 12:25:44 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Comment 16 errata-xmlrpc 2016-11-22 19:26:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html

Note You need to log in before you can comment on or make changes to this bug.