Description of problem: While testing multiple start/stop of mds service, i am seeing lots of "heartbeat_map is_healthy 'MDSRank' had timed out after 15" messages Version-Release number of selected component (if applicable): ceph version 10.2.1-6.el7cp How reproducible: 3/5 Steps to Reproduce: 1. Have 3 MDS configured. 2. Mounted the FS, wrote some data ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 11055G 5723G 5332G 48.23 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS cephfs_data 1 2087G 56.63 1757G 697197 cephfs_metadata 2 49972k 0 1757G 13972 3. In a loop start/stop the active mds server. Actual results: Start/Stop is happening fine, but sometimes it is taking a bit long. That time i am seeing these messages. 2016-05-26 09:41:59.509772 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-05-26 09:42:04.509874 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-05-26 09:42:09.509952 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-05-26 09:42:14.510116 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-05-26 09:42:19.510420 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-05-26 09:42:24.510522 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-05-26 09:42:29.510654 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-05-26 09:42:34.510734 7f5354c77700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 Expected results: Shouldn't see this message Additional info:
How are you starting and stopping the MDS (what command?) Please set debug mds = 20, and capture the log from the point at which you ask the MDS to stop, to the point where it eventually stops (so that we can see what else is going on while the heartbeat_map messages are coming).
For start/stopping i am using the below command: systemctl stop ceph-mds systemctl start ceph-mds I had debug enabled , but i was unable to attach the logs as the file size was huge. As discussed over IRC, chopped the file having 100000 lines and attached. Steps: 1. I had 3 healthy mds running. 2. Started IO (Performing an Rsync to the mounted directory) 3. Stopped the mds2 ( which was currently being active) , saw the heartbeat_map message on mds2.log. So we have 2 MDS running. 4. Stopped the mds0 ( which was currently being active), again saw the same message in mds0.log. So we have 2 MDS running. But all through IO is continuing.
Created attachment 1161946 [details] 72K Lines of the log file
Tanay: can you confirm that the Ceph cluster in use did not have any customisation of the "ms type" config setting? We have a separate report upstream of a similar issue when "ms type = async" was set http://tracker.ceph.com/issues/16396
John: No i didn't had any customization
Hi Tanay, I'm looking into this bug and wanted to let you know I've been able to reproduce it: http://tracker.ceph.com/issues/16042#note-8 I'll keep you updated with our progress.
Tanay, the fix has been merged into master. A backport of the fix to Jewel is pending.
The fix will be in upstream's v10.2.3.
Bug Verified, Not seeing any issue while restarting MDS services. ceph version: ceph version 10.2.3-4.el7cp (852125d923e43802a51f681ca2ae9e721eec91ca) RHEL Version: Red Hat Enterprise Linux Server release 7.3 (Maipo) Kernel Version: Linux node2 3.10.0-511.el7.x86_64 #1 SMP Wed Sep 28 12:25:44 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2815.html