Description of problem: running the command "rados df" Version-Release number of selected component: ceph-common-0.80.5-10.fc21 Additional info: reporter: libreport-2.2.3 backtrace_rating: 4 cmdline: rados df crash_function: send executable: /usr/bin/rados kernel: 3.16.2-301.fc21.x86_64 runlevel: N 5 type: CCpp uid: 0 Truncated backtrace: Thread no. 1 (4 frames) #0 send at /lib64/libpthread.so.0 #1 reraise_fatal at global/signal_handler.cc:59 #2 handle_fatal_signal at global/signal_handler.cc:105 #3 sendto at /lib64/libpthread.so.0
Created attachment 939678 [details] File: backtrace
Created attachment 939679 [details] File: cgroup
Created attachment 939680 [details] File: core_backtrace
Created attachment 939681 [details] File: dso_list
Created attachment 939682 [details] File: environ
Created attachment 939683 [details] File: limits
Created attachment 939684 [details] File: maps
Created attachment 939685 [details] File: open_fds
Created attachment 939686 [details] File: proc_pid_status
Created attachment 939687 [details] File: var_log_messages
*** Bug 1157192 has been marked as a duplicate of this bug. ***
FYI: I suspect that this is related to bz1146967. It appears to be an Intel TSX instruction that causes the SIGILL.
This appears to be a glibc issue with Intel TSX instructions -> reassigning. Copying from bz1157192 which seems to be related to this: On shutdown, the program calls rados_shutdown() which calls the appropriate destructors. In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This causes the program to receive SIGILL signal. Debugging with gdb, it seems that the instruction that causes this is xend [1] which is an Intel TSX instruction. I am no expert in this matter but it seems that the issue in [2] is not fully resolved, yet. I've looked at the patches there and xbegin seems to be explicitly disabled -- unlike xend. Maybe, we need to explicitly disable xend in the code as well? [1] layout asm in gdb shows this as the crashing line: >│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19> xend | [2] https://bugzilla.redhat.com/show_bug.cgi?id=1146967
FYI, this is actually a bug in Ceph: it's unlocking an already unlocked pthread_rwlock_t, which invokes undefined behavior. It was reported upstream in https://sourceware.org/bugzilla/show_bug.cgi?id=17561 , and the glibc devs decided to not make the crash more graceful as it would slow down the unlock path for correct programs. I'm attaching a minimal repro C file in case you care to reproduce, but afaict the linked bug is actually just a bug in Ceph, nothing to do with TSX or glibc directly.
Created attachment 964858 [details] Reproduction of the Ceph xend crash Note the undefined behavior of unlocking an unlocked rwlock.
(In reply to David Anderson from comment #14) > FYI, this is actually a bug in Ceph: it's unlocking an already unlocked > pthread_rwlock_t, which invokes undefined behavior. It was reported upstream > in https://sourceware.org/bugzilla/show_bug.cgi?id=17561 , and the glibc > devs decided to not make the crash more graceful as it would slow down the > unlock path for correct programs. > > I'm attaching a minimal repro C file in case you care to reproduce, but > afaict the linked bug is actually just a bug in Ceph, nothing to do with TSX > or glibc directly. Thanks David. I'm assigning this back to ceph for them to investigate the potential double unlock.
Turns out the Ceph folks fixed it upstream 3 days ago: http://tracker.ceph.com/issues/10085 . From the bug details, it looks like they're planning to publish this in a point release, so it should eventually find its way into the Fedora package.
And if you care to patch this in Fedora ahead of the upstream release, `git diff 42c85e8 77deeaa` in the Ceph git repository will produce the minimum necessary patch to apply. It applies cleanly to the 0.87 Ceph source tree.
I added the patch David mentions to a private build of Ceph and it fixed my problem. I'm not sure when upstream is going to release a 0.87.x point release so I think that it would be good to apply the patch before that.
ceph-0.80.7-2.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/ceph-0.80.7-2.fc21
Package ceph-0.80.7-2.fc21: * should fix your issue, * was pushed to the Fedora 21 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing ceph-0.80.7-2.fc21' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-16519/ceph-0.80.7-2.fc21 then log in and leave karma (feedback).
*** Bug 1170657 has been marked as a duplicate of this bug. ***
ceph-0.80.7-2.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report.