Bug 1144794 - [abrt] ceph-common: send(): rados killed by SIGILL
Summary: [abrt] ceph-common: send(): rados killed by SIGILL
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: ceph
Version: 21
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Boris Ranto
QA Contact: Fedora Extras Quality Assurance
URL: https://retrace.fedoraproject.org/faf...
Whiteboard: abrt_hash:b3314392b86dcc337684e419082...
: 1157192 1170657 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-21 03:39 UTC by Jeffrey C. Ollie
Modified: 2014-12-23 18:32 UTC (History)
14 users (show)

Fixed In Version: ceph-0.80.7-2.fc21
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-23 18:32:19 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
File: backtrace (19.15 KB, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: cgroup (190 bytes, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: core_backtrace (4.86 KB, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: dso_list (2.08 KB, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: environ (1.95 KB, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: limits (1.29 KB, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: maps (11.68 KB, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: open_fds (138 bytes, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: proc_pid_status (908 bytes, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
File: var_log_messages (706 bytes, text/plain)
2014-09-21 03:39 UTC, Jeffrey C. Ollie
no flags Details
Reproduction of the Ceph xend crash (454 bytes, text/x-csrc)
2014-12-04 22:06 UTC, David Anderson
no flags Details

Description Jeffrey C. Ollie 2014-09-21 03:39:03 UTC
Description of problem:
running the command "rados df"

Version-Release number of selected component:
ceph-common-0.80.5-10.fc21

Additional info:
reporter:       libreport-2.2.3
backtrace_rating: 4
cmdline:        rados df
crash_function: send
executable:     /usr/bin/rados
kernel:         3.16.2-301.fc21.x86_64
runlevel:       N 5
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (4 frames)
 #0 send at /lib64/libpthread.so.0
 #1 reraise_fatal at global/signal_handler.cc:59
 #2 handle_fatal_signal at global/signal_handler.cc:105
 #3 sendto at /lib64/libpthread.so.0

Comment 1 Jeffrey C. Ollie 2014-09-21 03:39:12 UTC
Created attachment 939678 [details]
File: backtrace

Comment 2 Jeffrey C. Ollie 2014-09-21 03:39:12 UTC
Created attachment 939679 [details]
File: cgroup

Comment 3 Jeffrey C. Ollie 2014-09-21 03:39:13 UTC
Created attachment 939680 [details]
File: core_backtrace

Comment 4 Jeffrey C. Ollie 2014-09-21 03:39:15 UTC
Created attachment 939681 [details]
File: dso_list

Comment 5 Jeffrey C. Ollie 2014-09-21 03:39:16 UTC
Created attachment 939682 [details]
File: environ

Comment 6 Jeffrey C. Ollie 2014-09-21 03:39:17 UTC
Created attachment 939683 [details]
File: limits

Comment 7 Jeffrey C. Ollie 2014-09-21 03:39:18 UTC
Created attachment 939684 [details]
File: maps

Comment 8 Jeffrey C. Ollie 2014-09-21 03:39:19 UTC
Created attachment 939685 [details]
File: open_fds

Comment 9 Jeffrey C. Ollie 2014-09-21 03:39:20 UTC
Created attachment 939686 [details]
File: proc_pid_status

Comment 10 Jeffrey C. Ollie 2014-09-21 03:39:21 UTC
Created attachment 939687 [details]
File: var_log_messages

Comment 11 Boris Ranto 2014-10-31 02:13:09 UTC
*** Bug 1157192 has been marked as a duplicate of this bug. ***

Comment 12 Boris Ranto 2014-10-31 12:23:54 UTC
FYI: I suspect that this is related to bz1146967. It appears to be an Intel TSX instruction that causes the SIGILL.

Comment 13 Boris Ranto 2014-10-31 13:14:34 UTC
This appears to be a glibc issue with Intel TSX instructions -> reassigning. Copying from bz1157192 which seems to be related to this:

On shutdown, the program calls rados_shutdown() which calls the appropriate destructors. In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This causes the program to receive SIGILL signal. Debugging with gdb, it seems that the instruction that causes this is xend [1] which is an Intel TSX instruction.

I am no expert in this matter but it seems that the issue in [2] is not fully resolved, yet. I've looked at the patches there and xbegin seems to be explicitly disabled -- unlike xend. Maybe, we need to explicitly disable xend in the code as well?

[1] layout asm in gdb shows this as the crashing line:

>│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19>        xend   |

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1146967

Comment 14 David Anderson 2014-12-04 22:05:24 UTC
FYI, this is actually a bug in Ceph: it's unlocking an already unlocked pthread_rwlock_t, which invokes undefined behavior. It was reported upstream in https://sourceware.org/bugzilla/show_bug.cgi?id=17561 , and the glibc devs decided to not make the crash more graceful as it would slow down the unlock path for correct programs.

I'm attaching a minimal repro C file in case you care to reproduce, but afaict the linked bug is actually just a bug in Ceph, nothing to do with TSX or glibc directly.

Comment 15 David Anderson 2014-12-04 22:06:33 UTC
Created attachment 964858 [details]
Reproduction of the Ceph xend crash

Note the undefined behavior of unlocking an unlocked rwlock.

Comment 16 Carlos O'Donell 2014-12-04 22:10:49 UTC
(In reply to David Anderson from comment #14)
> FYI, this is actually a bug in Ceph: it's unlocking an already unlocked
> pthread_rwlock_t, which invokes undefined behavior. It was reported upstream
> in https://sourceware.org/bugzilla/show_bug.cgi?id=17561 , and the glibc
> devs decided to not make the crash more graceful as it would slow down the
> unlock path for correct programs.
> 
> I'm attaching a minimal repro C file in case you care to reproduce, but
> afaict the linked bug is actually just a bug in Ceph, nothing to do with TSX
> or glibc directly.

Thanks David. I'm assigning this back to ceph for them to investigate the potential double unlock.

Comment 17 David Anderson 2014-12-04 22:23:37 UTC
Turns out the Ceph folks fixed it upstream 3 days ago: http://tracker.ceph.com/issues/10085 . From the bug details, it looks like they're planning to publish this in a point release, so it should eventually find its way into the Fedora package.

Comment 18 David Anderson 2014-12-04 22:25:03 UTC
And if you care to patch this in Fedora ahead of the upstream release, `git diff 42c85e8 77deeaa` in the Ceph git repository will produce the minimum necessary patch to apply. It applies cleanly to the 0.87 Ceph source tree.

Comment 19 Jeffrey C. Ollie 2014-12-04 22:34:50 UTC
I added the patch David mentions to a private build of Ceph and it fixed my problem.  I'm not sure when upstream is going to release a 0.87.x point release so I think that it would be good to apply the patch before that.

Comment 20 Fedora Update System 2014-12-08 11:51:25 UTC
ceph-0.80.7-2.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/ceph-0.80.7-2.fc21

Comment 21 Fedora Update System 2014-12-12 04:05:04 UTC
Package ceph-0.80.7-2.fc21:
* should fix your issue,
* was pushed to the Fedora 21 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing ceph-0.80.7-2.fc21'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-16519/ceph-0.80.7-2.fc21
then log in and leave karma (feedback).

Comment 22 Boris Ranto 2014-12-23 12:32:01 UTC
*** Bug 1170657 has been marked as a duplicate of this bug. ***

Comment 23 Fedora Update System 2014-12-23 18:32:19 UTC
ceph-0.80.7-2.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.