Bug 1144794

Summary: [abrt] ceph-common: send(): rados killed by SIGILL
Product: [Fedora] Fedora Reporter: Jeffrey C. Ollie <jeff>
Component: cephAssignee: Boris Ranto <branto>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 21CC: arjun, branto, codonell, crobinso, dave, david, fedora, jakub, kkeithle, law, pfrankli, spoyarek, steve, zaitcev
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
URL: https://retrace.fedoraproject.org/faf/reports/bthash/5650292d87291074b49dea20c1828ea430aa9839
Whiteboard: abrt_hash:b3314392b86dcc337684e419082aea13152385ea
Fixed In Version: ceph-0.80.7-2.fc21 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-23 18:32:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
File: backtrace
none
File: cgroup
none
File: core_backtrace
none
File: dso_list
none
File: environ
none
File: limits
none
File: maps
none
File: open_fds
none
File: proc_pid_status
none
File: var_log_messages
none
Reproduction of the Ceph xend crash none

Description Jeffrey C. Ollie 2014-09-21 03:39:03 UTC
Description of problem:
running the command "rados df"

Version-Release number of selected component:
ceph-common-0.80.5-10.fc21

Additional info:
reporter:       libreport-2.2.3
backtrace_rating: 4
cmdline:        rados df
crash_function: send
executable:     /usr/bin/rados
kernel:         3.16.2-301.fc21.x86_64
runlevel:       N 5
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (4 frames)
 #0 send at /lib64/libpthread.so.0
 #1 reraise_fatal at global/signal_handler.cc:59
 #2 handle_fatal_signal at global/signal_handler.cc:105
 #3 sendto at /lib64/libpthread.so.0

Comment 1 Jeffrey C. Ollie 2014-09-21 03:39:12 UTC
Created attachment 939678 [details]
File: backtrace

Comment 2 Jeffrey C. Ollie 2014-09-21 03:39:12 UTC
Created attachment 939679 [details]
File: cgroup

Comment 3 Jeffrey C. Ollie 2014-09-21 03:39:13 UTC
Created attachment 939680 [details]
File: core_backtrace

Comment 4 Jeffrey C. Ollie 2014-09-21 03:39:15 UTC
Created attachment 939681 [details]
File: dso_list

Comment 5 Jeffrey C. Ollie 2014-09-21 03:39:16 UTC
Created attachment 939682 [details]
File: environ

Comment 6 Jeffrey C. Ollie 2014-09-21 03:39:17 UTC
Created attachment 939683 [details]
File: limits

Comment 7 Jeffrey C. Ollie 2014-09-21 03:39:18 UTC
Created attachment 939684 [details]
File: maps

Comment 8 Jeffrey C. Ollie 2014-09-21 03:39:19 UTC
Created attachment 939685 [details]
File: open_fds

Comment 9 Jeffrey C. Ollie 2014-09-21 03:39:20 UTC
Created attachment 939686 [details]
File: proc_pid_status

Comment 10 Jeffrey C. Ollie 2014-09-21 03:39:21 UTC
Created attachment 939687 [details]
File: var_log_messages

Comment 11 Boris Ranto 2014-10-31 02:13:09 UTC
*** Bug 1157192 has been marked as a duplicate of this bug. ***

Comment 12 Boris Ranto 2014-10-31 12:23:54 UTC
FYI: I suspect that this is related to bz1146967. It appears to be an Intel TSX instruction that causes the SIGILL.

Comment 13 Boris Ranto 2014-10-31 13:14:34 UTC
This appears to be a glibc issue with Intel TSX instructions -> reassigning. Copying from bz1157192 which seems to be related to this:

On shutdown, the program calls rados_shutdown() which calls the appropriate destructors. In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This causes the program to receive SIGILL signal. Debugging with gdb, it seems that the instruction that causes this is xend [1] which is an Intel TSX instruction.

I am no expert in this matter but it seems that the issue in [2] is not fully resolved, yet. I've looked at the patches there and xbegin seems to be explicitly disabled -- unlike xend. Maybe, we need to explicitly disable xend in the code as well?

[1] layout asm in gdb shows this as the crashing line:

>│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19>        xend   |

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1146967

Comment 14 David Anderson 2014-12-04 22:05:24 UTC
FYI, this is actually a bug in Ceph: it's unlocking an already unlocked pthread_rwlock_t, which invokes undefined behavior. It was reported upstream in https://sourceware.org/bugzilla/show_bug.cgi?id=17561 , and the glibc devs decided to not make the crash more graceful as it would slow down the unlock path for correct programs.

I'm attaching a minimal repro C file in case you care to reproduce, but afaict the linked bug is actually just a bug in Ceph, nothing to do with TSX or glibc directly.

Comment 15 David Anderson 2014-12-04 22:06:33 UTC
Created attachment 964858 [details]
Reproduction of the Ceph xend crash

Note the undefined behavior of unlocking an unlocked rwlock.

Comment 16 Carlos O'Donell 2014-12-04 22:10:49 UTC
(In reply to David Anderson from comment #14)
> FYI, this is actually a bug in Ceph: it's unlocking an already unlocked
> pthread_rwlock_t, which invokes undefined behavior. It was reported upstream
> in https://sourceware.org/bugzilla/show_bug.cgi?id=17561 , and the glibc
> devs decided to not make the crash more graceful as it would slow down the
> unlock path for correct programs.
> 
> I'm attaching a minimal repro C file in case you care to reproduce, but
> afaict the linked bug is actually just a bug in Ceph, nothing to do with TSX
> or glibc directly.

Thanks David. I'm assigning this back to ceph for them to investigate the potential double unlock.

Comment 17 David Anderson 2014-12-04 22:23:37 UTC
Turns out the Ceph folks fixed it upstream 3 days ago: http://tracker.ceph.com/issues/10085 . From the bug details, it looks like they're planning to publish this in a point release, so it should eventually find its way into the Fedora package.

Comment 18 David Anderson 2014-12-04 22:25:03 UTC
And if you care to patch this in Fedora ahead of the upstream release, `git diff 42c85e8 77deeaa` in the Ceph git repository will produce the minimum necessary patch to apply. It applies cleanly to the 0.87 Ceph source tree.

Comment 19 Jeffrey C. Ollie 2014-12-04 22:34:50 UTC
I added the patch David mentions to a private build of Ceph and it fixed my problem.  I'm not sure when upstream is going to release a 0.87.x point release so I think that it would be good to apply the patch before that.

Comment 20 Fedora Update System 2014-12-08 11:51:25 UTC
ceph-0.80.7-2.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/ceph-0.80.7-2.fc21

Comment 21 Fedora Update System 2014-12-12 04:05:04 UTC
Package ceph-0.80.7-2.fc21:
* should fix your issue,
* was pushed to the Fedora 21 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing ceph-0.80.7-2.fc21'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-16519/ceph-0.80.7-2.fc21
then log in and leave karma (feedback).

Comment 22 Boris Ranto 2014-12-23 12:32:01 UTC
*** Bug 1170657 has been marked as a duplicate of this bug. ***

Comment 23 Fedora Update System 2014-12-23 18:32:19 UTC
ceph-0.80.7-2.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.