Description of problem: During a smallfile run with 512 threads doing reads across 32 clients on a linode cluster, we got a Ceph MDS hang. Patrick says the coredump indicates that a journal flush was hanging. Version-Release number of selected component (if applicable): ceph-common-12.2.7-329.g57f0061.el7.x86_64 centos 7.5, kernel 3.10.0-862.2.3.el7.x86_64 The Ceph RPMs were obtained by ceph-ansible from ceph_dev_branch:zyan-luminous-integration How reproducible: Not sure yet Steps to Reproduce: 1. configure ceph-linode cluster as described below 2. run smallfile workload as described below Actual results: workload hangs, forever. Expected results: Workload does not hang. Additional info: Patrick> Looks like it's hanging on a journal flush. The code shouldn't be waiting like that. Please open a BZ. Core dump is at: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-26-15-41/cores/mds-000/crash/ceph-mds-li1272-19.members.linode.com-12657-1535329050.core.gz To get the stack traces: uncompress, install corresponding ceph-debuginfo RPM and gdb /usr/bin/ceph-mds the.core gdb) thread apply all bt In the hung MDS process, I found this thread, looks like maybe it was hung waiting for an event while trying to evict a Cephfs client. Thread 11 (Thread 0x7f14e11c5700 (LWP 10777)): #0 0x00007f14ec009995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000561a5c8cdc3c in Wait (mutex=..., this=0x7f14e11c4060) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Cond.h:48 #2 C_SaferCond::wait (this=this@entry=0x7f14e11c4000) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Cond.h:194 #3 0x0000561a5c8b3fa2 in MDSRank::__lambda10::operator() (__closure=0x561a7c0d6480) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSRank.cc:2787 #4 0x0000561a5c8b6217 in operator() (r=<optimized out>, __closure=0x561a8ea398c0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSRank.cc:2816 #5 boost::detail::function::void_function_obj_invoker1<MDSRank::evict_client(int64_t, bool, bool, std::stringstream&, Context*)::__lambda11::__lambda12::__lambda13, void, i nt>::invoke(boost::detail::function::function_buffer &, int) (function_obj_ptr=..., a0=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/build/boost/include/boost/function/function_template.hpp:159 #6 0x0000561a5c8a745a in operator() (a0=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/build/boost/include/boost/function/function_template.hpp:760 #7 FunctionContext::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/include/Context.h:493 #8 0x0000561a5c8a4969 in Context::complete (this=0x561a8ea39b30, r=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/include/Context.h:70 #9 0x0000561a5cbdeae8 in Finisher::finisher_thread_entry (this=0x561a661669a0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Finisher.cc:72 #10 0x00007f14ec005e25 in start_thread () from /lib64/libpthread.so.0 #11 0x00007f14eb0e5bad in clone () from /lib64/libc.so.6 and this thread: Thread 6 (Thread 0x7f14e39ca700 (LWP 10742)): #0 0x00007f14ec006f97 in pthread_join () from /lib64/libpthread.so.0 #1 0x0000561a5cd25e50 in Thread::join (this=this@entry=0x561a66166ac0, prval=prval@entry=0x0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Thread.cc:163 #2 0x0000561a5cbde110 in Finisher::stop (this=0x561a661669a0) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/common/Finisher.cc:25 #3 0x0000561a5c8ab4ff in MDSRankDispatcher::shutdown (this=0x561a66157000) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSRank.cc:351 #4 0x0000561a5c89a891 in MDSDaemon::suicide (this=this@entry=0x561a660d4000) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSDaemon.cc:1075 #5 0x0000561a5c89aa2c in MDSDaemon::handle_signal (this=0x561a660d4000, signum=<optimized out>) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/mds/MDSDaemon.cc:1035 #6 0x0000561a5cba0237 in SignalHandler::entry (this=0x561a65ef0680) at /usr/src/debug/ceph-12.2.7-329-g57f0061/src/global/signal_handler.cc:302 #7 0x00007f14ec005e25 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f14eb0e5bad in clone () from /lib64/libc.so.6
I'll take a look...
Here are 2 other forced core dumps where the workload hung, but when I forced the core dump, the smallfile workload started right up again as if nothing had happened. This is becoming a pattern - no clear reason why it is hanging. The test run is: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/dallas.logs/smf-2018-08-29-10-42/ cores/mds-000/crash/ceph-mds-li108-47.members.linode.com-23551-1535563501.core.gz cores/mds-001/crash/ceph-mds-li496-6.members.linode.com-23340-1535563707.core.gz This script tells you when the core dump was generated - just feed it the pathname of the uncompressed core file. http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/time-of-core-dump.py [root@li35-54 tmp]# for n in *.core ; do ls -l $n ; ~/time-of-core-dump.py $n ; done -rw-------. 1 root root 13488488448 Aug 30 18:22 ceph-mds-li108-47.members.linode.com-23551-1535563501.core Wed Aug 29 17:25:01 2018 -rw-------. 1 root root 4716032000 Aug 30 18:22 ceph-mds-li496-6.members.linode.com-23340-1535563707.core Wed Aug 29 17:28:27 2018 and it happened during this smallfile append test: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/dallas.logs/smf-2018-08-29-10-42/sample-1/ops/append/smf.log It's just a gut feel thing but my hypothesis is that Filestore is causing MDS to hang, because it causes huge latencies due to directory splits and fssync() taking a long time, am going to try Bluestore and see if the same workload completes, Bluestore has lower tail latency due to direct I/O.
Changed name of bz because I'm not convinced of Patrick's explanation, that will come from analysis. Main thing is that i can frequently reproduce a behavior that seems to consist of 3 phases: - MDS is having super-long latencies on writes to cephfs_metadata - MDS evicts clients for failure to respond - cluster-wide throughput drops to zero I can restore throughput by restarting MDS processes, so its not the OSDs that are hung, or the clients, or smallfile. Forget my Filestore hypothesis. I made MDS hang with Bluestore. Several times. this ceph.log shows one of them. http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/dallas.logs/smf-2018-09-03-20-56/ceph.log
Thanks, Ben. I'll look into this since it is not filestore/bluestore or OSD latencies.
Lamba function kill_mds_session (in MDSRank::evict_client) waited for session to be evicted. The function blocked finisher. So finisher can't execute later contexts (including the context of session eviction).
I updated ceph-ci/zyan-luminous-integration
*** Bug 1628308 has been marked as a duplicate of this bug. ***
the test should do client eviction frequently. test should be something like: 1. mount cephfs 2. run smallfile on cephfs 3. wait random time and evict the client 4. goto step 1 above steps are be executed by multiple mount point parallelly
Thanks Zheng, will try this steps and update the bug.
yes. this bug can be marked as verified
Thanks Zheng. Moving this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3530