Bug 1340004

Summary: Seeing lots of "heartbeat_map" messages when stopping an MDS Server
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tanay Ganguly <tganguly>
Component: CephFSAssignee: John Spray <john.spray>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 2.0CC: ceph-eng-bugs, hnallurv, john.spray, kdreyer, kurs, nlevine, pdonnell, rperiyas, uboppana
Target Milestone: rcKeywords: Rebase
Target Release: 2.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-22 19:26:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
72K Lines of the log file none

Description Tanay Ganguly 2016-05-26 10:36:01 UTC
Description of problem:
While testing multiple start/stop of mds service, i am seeing lots of 
"heartbeat_map is_healthy 'MDSRank' had timed out after 15" messages

Version-Release number of selected component (if applicable):
ceph version 10.2.1-6.el7cp

How reproducible:
3/5

Steps to Reproduce:
1. Have 3 MDS configured.
2. Mounted the FS, wrote some data 
ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED 
    11055G     5723G        5332G         48.23 
POOLS:
    NAME                ID     USED       %USED     MAX AVAIL     OBJECTS 
    cephfs_data         1       2087G     56.63         1757G      697197 
    cephfs_metadata     2      49972k         0         1757G       13972 

3. In a loop start/stop the active mds server.

Actual results:
Start/Stop is happening fine, but sometimes it is taking a bit long. That time i am seeing these messages.

2016-05-26 09:41:59.509772 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:04.509874 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:09.509952 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:14.510116 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:19.510420 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:24.510522 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:29.510654 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-05-26 09:42:34.510734 7f5354c77700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15


Expected results:
Shouldn't see this message

Additional info:

Comment 2 John Spray 2016-05-26 10:40:42 UTC
How are you starting and stopping the MDS (what command?)

Please set debug mds = 20, and capture the log from the point at which you ask the MDS to stop, to the point where it eventually stops (so that we can see what else is going on while the heartbeat_map messages are coming).

Comment 3 Tanay Ganguly 2016-05-26 12:06:13 UTC
For start/stopping i am using the below command:

systemctl stop ceph-mds
systemctl start ceph-mds


I had debug enabled , but i was unable to attach the logs as the file size was huge. As discussed over IRC, chopped the file having 100000 lines and attached.

Steps:
1. I had 3 healthy mds running.
2. Started IO (Performing an Rsync to the mounted directory)
3. Stopped the mds2 ( which was currently being active) , saw the heartbeat_map message on mds2.log. So we have 2 MDS running.
4. Stopped the mds0  ( which was currently being active), again saw the same message in mds0.log. So we have 2 MDS running.


But all through IO is continuing.

Comment 4 Tanay Ganguly 2016-05-26 12:13:10 UTC
Created attachment 1161946 [details]
72K Lines of the log file

Comment 5 John Spray 2016-06-24 10:16:42 UTC
Tanay: can you confirm that the Ceph cluster in use did not have any customisation of the "ms type" config setting?  We have a separate report upstream of a similar issue when "ms type = async" was set http://tracker.ceph.com/issues/16396

Comment 6 Tanay Ganguly 2016-06-29 05:57:59 UTC
John:
No i didn't had any customization

Comment 7 Patrick Donnelly 2016-06-30 20:49:37 UTC
Hi Tanay, I'm looking into this bug and wanted to let you know I've been able to reproduce it:

http://tracker.ceph.com/issues/16042#note-8

I'll keep you updated with our progress.

Comment 8 Patrick Donnelly 2016-07-21 16:32:40 UTC
Tanay, the fix has been merged into master. A backport of the fix to Jewel is pending.

Comment 10 Ken Dreyer (Red Hat) 2016-09-21 12:11:27 UTC
The fix will be in upstream's v10.2.3.

Comment 14 Ramakrishnan Periyasamy 2016-10-07 10:51:21 UTC
Bug Verified, Not seeing any issue while restarting MDS services.

ceph version:
ceph version 10.2.3-4.el7cp (852125d923e43802a51f681ca2ae9e721eec91ca)

RHEL Version:
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Kernel Version:
Linux node2 3.10.0-511.el7.x86_64 #1 SMP Wed Sep 28 12:25:44 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Comment 16 errata-xmlrpc 2016-11-22 19:26:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html