1623264 – Ceph MDS hang with smallfile workload

Bug 1623264 - Ceph MDS hang with smallfile workload

Summary: Ceph MDS hang with smallfile workload

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	z1
Target Release:	3.1
Assignee:	Yan, Zheng
QA Contact:	Ramakrishnan Periyasamy
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1628308 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-28 20:52 UTC by Ben England
Modified:	2018-11-09 01:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:	RHEL: ceph-12.2.5-47.el7cp Ubuntu: ceph_12.2.5-32redhat1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-09 00:59:32 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	35721	0	None	None	None	2018-09-06 00:48:30 UTC
Red Hat Product Errata	RHBA-2018:3530	0	None	None	None	2018-11-09 01:00:29 UTC

Description Ben England 2018-08-28 20:52:41 UTC

Description of problem:


During a smallfile run with 512 threads doing reads across 32 clients on a linode cluster, we got a Ceph MDS hang.  Patrick says the coredump indicates that a journal flush was hanging.

Version-Release number of selected component (if applicable):

ceph-common-12.2.7-329.g57f0061.el7.x86_64
centos 7.5, kernel 3.10.0-862.2.3.el7.x86_64

The Ceph RPMs were obtained by ceph-ansible from 

ceph_dev_branch:zyan-luminous-integration

How reproducible:

Not sure yet

Steps to Reproduce:
1. configure ceph-linode cluster as described below
2. run smallfile workload as described below

Actual results:

workload hangs, forever.

Expected results:

Workload  does not hang.

Additional info:


Patrick> Looks like it's hanging on a journal flush. The code shouldn't be
waiting like that. Please open a BZ.

Core dump is at:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-26-15-41/cores/mds-000/crash/ceph-mds-li1272-19.members.linode.com-12657-1535329050.core.gz

To get the stack traces: uncompress, install corresponding ceph-debuginfo RPM and 

gdb /usr/bin/ceph-mds the.core
gdb) thread apply all bt



In the hung MDS process, I found this thread, looks like maybe it was hung waiting for an event while trying to evict a Cephfs client.  


Thread 11 (Thread 0x7f14e11c5700 (LWP 10777)): 
#0  0x00007f14ec009995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 
#1  0x0000561a5c8cdc3c in Wait (mutex=..., this=0x7f14e11c4060) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Cond.h:48 
#2  C_SaferCond::wait (this=this@entry=0x7f14e11c4000) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Cond.h:194 
#3  0x0000561a5c8b3fa2 in MDSRank::__lambda10::operator() (__closure=0x561a7c0d6480) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSRank.cc:2787 
#4  0x0000561a5c8b6217 in operator() (r=<optimized out>, __closure=0x561a8ea398c0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSRank.cc:2816 
#5  boost::detail::function::void_function_obj_invoker1<MDSRank::evict_client(int64_t, bool, bool, std::stringstream&, Context*)::__lambda11::__lambda12::__lambda13, void, i 
nt>::invoke(boost::detail::function::function_buffer &, int) (function_obj_ptr=..., a0=<optimized out>) 
   at /usr/src/debug/ceph-12.2.7-329-g57f0061/build/boost/include/boost/function/function_template.hpp:159 
#6  0x0000561a5c8a745a in operator() (a0=<optimized out>, this=<optimized out>) 
   at /usr/src/debug/ceph-12.2.7-329-g57f0061/build/boost/include/boost/function/function_template.hpp:760 
#7  FunctionContext::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/include/Context.h:493 
#8  0x0000561a5c8a4969 in Context::complete (this=0x561a8ea39b30, r=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/include/Context.h:70 
#9  0x0000561a5cbdeae8 in Finisher::finisher_thread_entry (this=0x561a661669a0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Finisher.cc:72 
#10 0x00007f14ec005e25 in start_thread () from /lib64/libpthread.so.0 
#11 0x00007f14eb0e5bad in clone () from /lib64/libc.so.6 


and this thread:

Thread 6 (Thread 0x7f14e39ca700 (LWP 10742)): 
#0  0x00007f14ec006f97 in pthread_join () from /lib64/libpthread.so.0 
#1  0x0000561a5cd25e50 in Thread::join (this=this@entry=0x561a66166ac0, prval=prval@entry=0x0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Thread.cc:163 
#2  0x0000561a5cbde110 in Finisher::stop (this=0x561a661669a0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Finisher.cc:25 
#3  0x0000561a5c8ab4ff in MDSRankDispatcher::shutdown (this=0x561a66157000) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSRank.cc:351 
#4  0x0000561a5c89a891 in MDSDaemon::suicide (this=this@entry=0x561a660d4000) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSDaemon.cc:1075 
#5  0x0000561a5c89aa2c in MDSDaemon::handle_signal (this=0x561a660d4000, signum=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSDaemon.cc:1035 
#6  0x0000561a5cba0237 in SignalHandler::entry (this=0x561a65ef0680) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/global/signal_handler.cc:302
#7  0x00007f14ec005e25 in start_thread () from /lib64/libpthread.so.0 
#8  0x00007f14eb0e5bad in clone () from /lib64/libc.so.6

Comment 2 Venky Shankar 2018-08-30 06:37:23 UTC

I'll take a look...

Comment 3 Ben England 2018-08-30 20:45:55 UTC

Here are 2 other forced core dumps where the workload hung, but when I forced the core dump, the smallfile workload started right up again as if nothing had happened.  This is becoming a pattern - no clear reason why it is hanging.  The test run is:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/dallas.logs/smf-2018-08-29-10-42/

cores/mds-000/crash/ceph-mds-li108-47.members.linode.com-23551-1535563501.core.gz

cores/mds-001/crash/ceph-mds-li496-6.members.linode.com-23340-1535563707.core.gz

This script tells you when the core dump was generated - just feed it the pathname of the uncompressed core file.

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/time-of-core-dump.py

[root@li35-54 tmp]# for n in *.core ; do ls -l $n ; ~/time-of-core-dump.py $n ; done
-rw-------. 1 root root 13488488448 Aug 30 18:22 ceph-mds-li108-47.members.linode.com-23551-1535563501.core
Wed Aug 29 17:25:01 2018
-rw-------. 1 root root 4716032000 Aug 30 18:22 ceph-mds-li496-6.members.linode.com-23340-1535563707.core
Wed Aug 29 17:28:27 2018

and it happened during this smallfile append test:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/dallas.logs/smf-2018-08-29-10-42/sample-1/ops/append/smf.log

It's just a gut feel thing but my hypothesis is that Filestore is causing MDS to hang, because it causes huge latencies due to directory splits and fssync() taking a long time, am going to try Bluestore and see if the same workload completes, Bluestore has lower tail latency due to direct I/O.

Comment 4 Ben England 2018-09-05 03:17:01 UTC

Changed name of bz because I'm not convinced of Patrick's explanation, that will come from analysis.  Main thing is that i can frequently reproduce a behavior that seems to consist of 3 phases:

- MDS is having super-long latencies on writes to cephfs_metadata
- MDS evicts clients for failure to respond
- cluster-wide throughput drops to zero

I can restore throughput by restarting MDS processes, so its not the OSDs that are hung, or the clients, or smallfile.

Forget my Filestore hypothesis.  I made MDS hang with Bluestore. Several times.  this  ceph.log shows one of them.  

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/dallas.logs/smf-2018-09-03-20-56/ceph.log

Comment 5 Venky Shankar 2018-09-05 03:29:01 UTC

Thanks, Ben. I'll look into this since it is not filestore/bluestore or OSD latencies.

Comment 6 Yan, Zheng 2018-09-05 14:13:42 UTC

Lamba function kill_mds_session (in MDSRank::evict_client) waited for session to be evicted. The function blocked finisher. So finisher can't execute later contexts (including the context of session eviction).

Comment 7 Yan, Zheng 2018-09-06 00:48:30 UTC

I updated ceph-ci/zyan-luminous-integration

Comment 8 Patrick Donnelly 2018-09-17 16:47:15 UTC

*** Bug 1628308 has been marked as a duplicate of this bug. ***

Comment 16 Yan, Zheng 2018-10-22 13:04:19 UTC

the test should do client eviction frequently. test should be something like:

1. mount cephfs
2. run smallfile on cephfs
3. wait random time and evict the client
4. goto step 1


above steps are be executed by multiple mount point  parallelly

Comment 17 Ramakrishnan Periyasamy 2018-10-22 13:13:04 UTC

Thanks Zheng, will try this steps and update the bug.

Comment 19 Yan, Zheng 2018-10-23 06:53:20 UTC

yes. this bug can be marked as verified

Comment 20 Ramakrishnan Periyasamy 2018-10-23 10:46:04 UTC

Thanks Zheng.

Moving this bug to verified.

Comment 22 errata-xmlrpc 2018-11-09 00:59:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3530

Note You need to log in before you can comment on or make changes to this bug.