1614770 – Ceph MDS core in OpTracker::dump_ops_in_flight

Bug 1614770 - Ceph MDS core in OpTracker::dump_ops_in_flight

Summary: Ceph MDS core in OpTracker::dump_ops_in_flight

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	3.0
Hardware:	All
OS:	All
Priority:	low
Severity:	low
Target Milestone:	z1
Target Release:	3.1
Assignee:	Yan, Zheng
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-10 12:01 UTC by Ben England
Modified:	2018-11-09 01:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHEL: ceph-12.2.5-46.el7cp Ubuntu: ceph_12.2.5-31redhat1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-09 00:59:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
tarball containing MDS log with backtrace (322.10 KB, application/x-gzip) 2018-08-10 12:01 UTC, Ben England	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	26894	0	None	None	None	2018-08-13 21:19:10 UTC
Red Hat Product Errata	RHBA-2018:3530	0	None	None	None	2018-11-09 01:00:03 UTC

Description Ben England 2018-08-10 12:01:58 UTC

Created attachment 1474989 [details]
tarball containing MDS log with backtrace

Description of problem:

using a smallfile workload I can reproduce a Ceph MDS core dump.  Seems to be the same one each time.

Version-Release number of selected component (if applicable):

for now:
Ceph Luminous 12.2.7
centos7.5

switching to July 14 RHCS 3.1 build today

How reproducible:

every time.  

Steps to Reproduce:
1. use ceph-linode to construct a cluster (details below)
2. use setup-smallfile.sh to construct mountpoints and set up smallfile
3. use run-smallfile.sh to run a metadata-intensive load from many clients simultaneously

Actual results:

MDS coredump

Expected results:

no MDS core dump

Additional info:

backtrace is in attached MDS log, search for "dump_ops_in_flight"

more info available at:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-10-00-10/

Comment 3 Ben England 2018-08-10 15:44:35 UTC

Trying to move to RHCS 3.1 QE build.

when I install [this iso image](http://download-node-02.eng.bos.redhat.com/rel-eng/RHCEPH-3.1-RHEL-7-20180614.0/compose/MON/x86_64/iso/RHCEPH-3.1-RHEL-7-20180614.0-x86_64-dvd.iso) on linodes wi centos7 image, I get this error:

```
fatal: [mon-000]: FAILED! => {
    "changed": false, 
    "rc": 1, 
    "results": [
        "Resolving Dependencies\n--> Running transaction check\n---> Package ceph-common.x86_64 2:12.2.5-25.el7cp will be installed\n--> ...
...
MSG:

warning: /tmp/rh-storage-repo/MON/ceph-common-12.2.5-25.el7cp.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID f21541eb: NOKEY


Public key for librados2-12.2.5-25.el7cp.x86_64.rpm is not installed
```
I installed the QE public key ending in "d4bQ" on all hosts as described in https://wiki.test.redhat.com/CEPH but this did not have any effect.  When I tried the same thing on a single monitor, I also get this error:

Public key for userspace-rcu-0.7.16-1.el7cp.x86_64.rpm is not installed

Falling back to latest GA release below, which got by this problem because its "gold-signed".

http://download-node-02.eng.bos.redhat.com/rel-eng/RHCEPH-3.0-RHEL-7-20180806.0/compose/Tools/x86_64/iso/RHCEPH-3.0-RHEL-7-20180806.0-x86_64-dvd.iso

Comment 4 Ben England 2018-08-10 20:19:28 UTC

No MDS cores with RHCS 3.0 June 18 edition, but > 10 OSD core dumps when I went up to 64-KB file size and went up to 100 threads doing file creates.  Some of them are zero length because I ran out of space on system disk.   Also, one of the MDS servers locked up, I captured blocked_ops for mds-002 and when I restarted mds-002 the workload continued. My point is that yes I'm overloading the 6 OSDs in this config but they should never core dump, should be stable.   people expect filesystems to provide fairness too. 

OSD core dumps here:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/crash/

Workload data set that goes with the core dumps will be:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-10-17-15

But it may be a few hours before it is done and pushed.

Ceph was in a bad state when creates first started, massive numbers of slow requests.  To get this thing unstuck, I tried this on OSDs:

vm.dirty_ratio=10 vm.dirty_background_ratio=5 vm.swappiness=10 

The idea is to push writes more quickly to block device and avoid consuming buffer cache with dirty pages, because when syncfs finally happens, then it's a big pause.  These are high-IOPS devices, shouldn't get so backed up.

Comment 5 Ben England 2018-08-13 16:57:27 UTC

I pushed the core dumps to the above locations, looks like you already fixed the core dump that started this bz.   Where should I get a RHCS build that has this fix?  In meantime I may have to just take upstream Luminous build that has the fix.  Will keep trying.

Comment 6 Ben England 2018-08-14 13:15:30 UTC

when I ran smallfile workload last night, there were no core dumps at all.  these things changed:

- number of clients - I had same number of threads but on fewer clients, I don't think this is that significant.
- linux 4.17 kernel - by accident, linode did this, haven't figured out how to automatically undo.
- osds: sysctl -w vm.dirty_ratio=10 vm.dirty_background_ratio=5 vm.swappiness=10
- wasn't running with "ceph mgr module enable dashboard"

I'm trying to isolate which of these variables caused it to change.  The dump_ops_in_flight core dumps looked like it might have been caused by combination of that and a high backlog of requests?

Also I had the wrong ceph-debuginfo, now I have the right one, /root/ceph-debuginfo-12.2.4-42.el7cp.x86_64.rpm

Comment 7 Ben England 2018-08-14 18:04:09 UTC

also I had to add 'ulimit -c unlimited' to my scripts, now it generates full, readable core dumps. Above core dumps may not be readable, I'll try to get you good ones.

Comment 8 Yan, Zheng 2018-08-15 07:37:51 UTC

https://github.com/ceph/ceph/pull/23584

Comment 9 Ben England 2018-08-17 15:16:57 UTC

got this core dump last night from RHCEPH iso above, 

gdb) bt
#0  0x00007fb6e907f59b in raise () from /lib64/libpthread.so.0
#1  0x000056098353ca56 in reraise_fatal (signum=6) at /usr/src/debug/ceph-12.2.4/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.2.4/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007fb6e808f277 in raise () from /lib64/libc.so.6
#5  0x00007fb6e8090968 in abort () from /lib64/libc.so.6
#6  0x00007fb6e899e7d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#7  0x00007fb6e899c746 in ?? () from /lib64/libstdc++.so.6
#8  0x00007fb6e899c773 in std::terminate() () from /lib64/libstdc++.so.6
#9  0x00007fb6e899d2df in __cxa_pure_virtual () from /lib64/libstdc++.so.6
#10 0x0000560983270f64 in Message::print (this=<optimized out>, out=...) at /usr/src/debug/ceph-12.2.4/src/msg/Message.h:477
#11 0x00005609834e9d60 in TrackedOp::_gen_desc (this=0x5609abca6300) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.h:264
#12 0x00005609834e1676 in get_desc (this=0x5609abca6300) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.h:257
#13 TrackedOp::dump (this=this@entry=0x5609abca6300, now=..., now@entry=..., f=f@entry=0x5609cf102000) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.cc:447
#14 0x00005609834e1ac9 in OpTracker::dump_ops_in_flight (this=this@entry=0x56098dc2d2a8, f=f@entry=0x5609cf102000, print_only_blocked=print_only_blocked@entry=false, 
    filters=std::set with 1 elements = {...}) at /usr/src/debug/ceph-12.2.4/src/common/TrackedOp.cc:229
#15 0x000056098326db0a in MDSRankDispatcher::handle_asok_command (this=0x56098dc2d000, command="ops", cmdmap=std::map with 2 elements = {...}, f=f@entry=0x5609cf102000, 
    ss=...) at /usr/src/debug/ceph-12.2.4/src/mds/MDSRank.cc:1902
#16 0x000056098323ddf8 in MDSDaemon::asok_command (this=0x56098dbb8000, command="ops", cmdmap=std::map with 2 elements = {...}, format="json", ss=...)
    at /usr/src/debug/ceph-12.2.4/src/mds/MDSDaemon.cc:145
#17 0x000056098324bc3b in MDSSocketHook::call (this=this@entry=0x56098d99c290, command="ops", cmdmap=std::map with 2 elements = {...}, format="json", out=...)
    at /usr/src/debug/ceph-12.2.4/src/mds/MDSDaemon.cc:124
#18 0x0000560983561205 in AdminSocket::do_accept (this=this@entry=0x56098d9da4e0) at /usr/src/debug/ceph-12.2.4/src/common/admin_socket.cc:412
#19 0x0000560983562c58 in AdminSocket::entry (this=0x56098d9da4e0) at /usr/src/debug/ceph-12.2.4/src/common/admin_socket.cc:263
#20 0x00007fb6e9077e25 in start_thread () from /lib64/libpthread.so.0
#21 0x00007fb6e8157bad in clone () from /lib64/libc.so.6

Will post the core dump and then move on.  Interesting fact - I couldnt produce this crash with the Linux 4.17 kernel, it was produced with the Linux 3.10 kernel thanks to patrick's fix.  Note that "ceph mgr module enable dashboard" command had been done, perhaps this was what was polling the admin socket when it crashed.  

Will move onto the branch with the fix next.  Also am trying to get some scaling numbers for Ceph MDS with 100 threads.  If this works, I'll go up to 500 threads on 50 clients or something like that to simulate NVidia's workload.  I think I can show that it does scale, as long as it doesn't crash, except for renames perhaps.  Sorry this takes so long, but it is getting easier.

Comment 10 Ben England 2018-08-17 17:02:51 UTC

couldn't attach it, too big, so you will be able to get it from here when it finishes copying:

http://perf1.perf.lab.eng.bos.redhat.com/pub/bengland/public/ceph/linode/logs/smf-2018-08-16-22-10/cores/mds-003/crash/admin_socket-li1847-65.members.linode.com-2682-1534460033.core.gz

associated test data is in directory above cores/

Comment 11 Ben England 2018-08-17 21:09:35 UTC

I ran the same test with the zyan-luminous-integration build and no core dump so far.  If this keeps up for a while then this bug is fixed.  The core dump above should be accessible now, I set it to world-readable on NFS.

Comment 12 Ben England 2018-08-20 20:57:24 UTC

I've been running for some time without any core dumps here using the centos7.5 kernel so we can close this.

Comment 23 errata-xmlrpc 2018-11-09 00:59:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3530

Note You need to log in before you can comment on or make changes to this bug.