Created attachment 1474989 [details] tarball containing MDS log with backtrace Description of problem: using a smallfile workload I can reproduce a Ceph MDS core dump. Seems to be the same one each time. Version-Release number of selected component (if applicable): for now: Ceph Luminous 12.2.7 centos7.5 switching to July 14 RHCS 3.1 build today How reproducible: every time. Steps to Reproduce: 1. use ceph-linode to construct a cluster (details below) 2. use setup-smallfile.sh to construct mountpoints and set up smallfile 3. use run-smallfile.sh to run a metadata-intensive load from many clients simultaneously Actual results: MDS coredump Expected results: no MDS core dump Additional info: backtrace is in attached MDS log, search for "dump_ops_in_flight" more info available at: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-10-00-10/
Trying to move to RHCS 3.1 QE build. when I install [this iso image](http://download-node-02.eng.bos.redhat.com/rel-eng/RHCEPH-3.1-RHEL-7-20180614.0/compose/MON/x86_64/iso/RHCEPH-3.1-RHEL-7-20180614.0-x86_64-dvd.iso) on linodes wi centos7 image, I get this error: ``` fatal: [mon-000]: FAILED! => { "changed": false, "rc": 1, "results": [ "Resolving Dependencies\n--> Running transaction check\n---> Package ceph-common.x86_64 2:12.2.5-25.el7cp will be installed\n--> ... ... MSG: warning: /tmp/rh-storage-repo/MON/ceph-common-12.2.5-25.el7cp.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID f21541eb: NOKEY Public key for librados2-12.2.5-25.el7cp.x86_64.rpm is not installed ``` I installed the QE public key ending in "d4bQ" on all hosts as described in https://wiki.test.redhat.com/CEPH but this did not have any effect. When I tried the same thing on a single monitor, I also get this error: Public key for userspace-rcu-0.7.16-1.el7cp.x86_64.rpm is not installed Falling back to latest GA release below, which got by this problem because its "gold-signed". http://download-node-02.eng.bos.redhat.com/rel-eng/RHCEPH-3.0-RHEL-7-20180806.0/compose/Tools/x86_64/iso/RHCEPH-3.0-RHEL-7-20180806.0-x86_64-dvd.iso
No MDS cores with RHCS 3.0 June 18 edition, but > 10 OSD core dumps when I went up to 64-KB file size and went up to 100 threads doing file creates. Some of them are zero length because I ran out of space on system disk. Also, one of the MDS servers locked up, I captured blocked_ops for mds-002 and when I restarted mds-002 the workload continued. My point is that yes I'm overloading the 6 OSDs in this config but they should never core dump, should be stable. people expect filesystems to provide fairness too. OSD core dumps here: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/crash/ Workload data set that goes with the core dumps will be: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-10-17-15 But it may be a few hours before it is done and pushed. Ceph was in a bad state when creates first started, massive numbers of slow requests. To get this thing unstuck, I tried this on OSDs: vm.dirty_ratio=10 vm.dirty_background_ratio=5 vm.swappiness=10 The idea is to push writes more quickly to block device and avoid consuming buffer cache with dirty pages, because when syncfs finally happens, then it's a big pause. These are high-IOPS devices, shouldn't get so backed up.
I pushed the core dumps to the above locations, looks like you already fixed the core dump that started this bz. Where should I get a RHCS build that has this fix? In meantime I may have to just take upstream Luminous build that has the fix. Will keep trying.
when I ran smallfile workload last night, there were no core dumps at all. these things changed: - number of clients - I had same number of threads but on fewer clients, I don't think this is that significant. - linux 4.17 kernel - by accident, linode did this, haven't figured out how to automatically undo. - osds: sysctl -w vm.dirty_ratio=10 vm.dirty_background_ratio=5 vm.swappiness=10 - wasn't running with "ceph mgr module enable dashboard" I'm trying to isolate which of these variables caused it to change. The dump_ops_in_flight core dumps looked like it might have been caused by combination of that and a high backlog of requests? Also I had the wrong ceph-debuginfo, now I have the right one, /root/ceph-debuginfo-12.2.4-42.el7cp.x86_64.rpm
also I had to add 'ulimit -c unlimited' to my scripts, now it generates full, readable core dumps. Above core dumps may not be readable, I'll try to get you good ones.
https://github.com/ceph/ceph/pull/23584
got this core dump last night from RHCEPH iso above, gdb) bt #0 0x00007fb6e907f59b in raise () from /lib64/libpthread.so.0 #1 0x000056098353ca56 in reraise_fatal (signum=6) at /usr/src/debug/ceph-12.2.4/src/global/signal_handler.cc:74 #2 handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.2.4/src/global/signal_handler.cc:138 #3 <signal handler called> #4 0x00007fb6e808f277 in raise () from /lib64/libc.so.6 #5 0x00007fb6e8090968 in abort () from /lib64/libc.so.6 #6 0x00007fb6e899e7d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #7 0x00007fb6e899c746 in ?? () from /lib64/libstdc++.so.6 #8 0x00007fb6e899c773 in std::terminate() () from /lib64/libstdc++.so.6 #9 0x00007fb6e899d2df in __cxa_pure_virtual () from /lib64/libstdc++.so.6 #10 0x0000560983270f64 in Message::print (this=<optimized out>, out=...) at /usr/src/debug/ceph-12.2.4/src/msg/Message.h:477 #11 0x00005609834e9d60 in TrackedOp::_gen_desc (this=0x5609abca6300) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.h:264 #12 0x00005609834e1676 in get_desc (this=0x5609abca6300) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.h:257 #13 TrackedOp::dump (this=this@entry=0x5609abca6300, now=..., now@entry=..., f=f@entry=0x5609cf102000) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.cc:447 #14 0x00005609834e1ac9 in OpTracker::dump_ops_in_flight (this=this@entry=0x56098dc2d2a8, f=f@entry=0x5609cf102000, print_only_blocked=print_only_blocked@entry=false, filters=std::set with 1 elements = {...}) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.cc:229 #15 0x000056098326db0a in MDSRankDispatcher::handle_asok_command (this=0x56098dc2d000, command="ops", cmdmap=std::map with 2 elements = {...}, f=f@entry=0x5609cf102000, ss=...) at /usr/src/debug/ceph-12.2.4/src/mds/MDSRank.cc:1902 #16 0x000056098323ddf8 in MDSDaemon::asok_command (this=0x56098dbb8000, command="ops", cmdmap=std::map with 2 elements = {...}, format="json", ss=...) at /usr/src/debug/ceph-12.2.4/src/mds/MDSDaemon.cc:145 #17 0x000056098324bc3b in MDSSocketHook::call (this=this@entry=0x56098d99c290, command="ops", cmdmap=std::map with 2 elements = {...}, format="json", out=...) at /usr/src/debug/ceph-12.2.4/src/mds/MDSDaemon.cc:124 #18 0x0000560983561205 in AdminSocket::do_accept (this=this@entry=0x56098d9da4e0) at /usr/src/debug/ceph-12.2.4/src/common/admin_socket.cc:412 #19 0x0000560983562c58 in AdminSocket::entry (this=0x56098d9da4e0) at /usr/src/debug/ceph-12.2.4/src/common/admin_socket.cc:263 #20 0x00007fb6e9077e25 in start_thread () from /lib64/libpthread.so.0 #21 0x00007fb6e8157bad in clone () from /lib64/libc.so.6 Will post the core dump and then move on. Interesting fact - I couldnt produce this crash with the Linux 4.17 kernel, it was produced with the Linux 3.10 kernel thanks to patrick's fix. Note that "ceph mgr module enable dashboard" command had been done, perhaps this was what was polling the admin socket when it crashed. Will move onto the branch with the fix next. Also am trying to get some scaling numbers for Ceph MDS with 100 threads. If this works, I'll go up to 500 threads on 50 clients or something like that to simulate NVidia's workload. I think I can show that it does scale, as long as it doesn't crash, except for renames perhaps. Sorry this takes so long, but it is getting easier.
couldn't attach it, too big, so you will be able to get it from here when it finishes copying: http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-16-22-10/cores/mds-003/crash/admin_socket-li1847-65.members.linode.com-2682-1534460033.core.gz associated test data is in directory above cores/
I ran the same test with the zyan-luminous-integration build and no core dump so far. If this keeps up for a while then this bug is fixed. The core dump above should be accessible now, I set it to world-readable on NFS.
I've been running for some time without any core dumps here using the centos7.5 kernel so we can close this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3530