Bug 1310354 - Ceph Monitor crashed with Caught signal (Aborted) in pthread_join
Ceph Monitor crashed with Caught signal (Aborted) in pthread_join
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
x86_64 Linux
high Severity high
: rc
: 1.3.3
Assigned To: Greg Farnum
Depends On:
Blocks: 1319075
  Show dependency treegraph
Reported: 2016-02-20 16:22 EST by Mike Hackett
Modified: 2017-07-30 11:07 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-02-22 15:02:11 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
mon log from scrash (15.88 MB, application/x-gzip)
2016-02-20 16:22 EST, Mike Hackett
no flags Details

External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 13421 None None None 2016-02-20 16:49 EST

  None (edit)
Description Mike Hackett 2016-02-20 16:22:32 EST
Created attachment 1128864 [details]
mon log from scrash

Description of problem:
On 2/12 Cisco noticed a Monitor crash in their Tokyo production cluster. Stack is below:

2016-02-12 23:23:28.169430 7f88a84f97c0 -1 *** Caught signal (Aborted) **
 in thread 7f88a84f97c0

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-mon() [0x903c52]
 2: (()+0xf130) [0x7f88a76ba130]
 3: (pthread_join()+0xa7) [0x7f88a76b3f27]
 4: (Thread::join(void**)+0x12) [0x7a1b22]
 5: (DispatchQueue::wait()+0x12) [0x8a39a2]
 6: (SimpleMessenger::wait()+0x9a6) [0x797e66]
 7: (main()+0x267c) [0x54e32c]
 8: (__libc_start_main()+0xf5) [0x7f88a60c0af5]
 9: /usr/bin/ceph-mon() [0x55d9c9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

I spoke with badone on this crash and discussed it's possibly some sort of thread accounting problem it looks like and the actual crash is in glibc's pthread_join.

This was a single monitor in this cluster that encountered this issue and only has occurred once on 2/12 and has not re-occured since.

Version-Release number of selected component (if applicable):

How reproducible:
Has not been reproduced

Additional info:

SOS report is located:
Comment 2 Mike Hackett 2016-02-20 16:48:50 EST
This looks to have been seen upstream on an OSD node in Giant:

Comment 4 Brad Hubbard 2016-02-21 23:01:54 EST
As the note suggests it is not too hard to find out more information about this issue.

"NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this."

We set up the copy like so.

Install and register a 7.1 system.

$ subscription-manager repos --enable=rhel-7-server-rpms --enable=rhel-7-server-rhceph-1.3-calamari-rpms --enable=rhel-7-server-rhceph-1.3-installer-rpms --enable=rhel-7-server-rhceph-1.3-mon-rpms --enable=rhel-7-server-rhceph-1.3-osd-rpms --enable=rhel-7-server-rhceph-1.3-tools-rpms

$ yum install ceph-0.94.1-19.el7cp.x86_64 ceph-common-0.94.1-19.el7cp.x86_64 ceph-mon-0.94.1-19.el7cp.x86_64

Make sure you have the right glibc installed (based on the sosreport).

$ debuginfo-install -y glibc ceph

$ gdb `which ceph-mon`
(gdb) b main
Breakpoint 1 at 0x54bcb0: file ceph_mon.cc, line 205.
(gdb) r
Starting program: /usr/bin/ceph-mon 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, main (argc=1, argv=0x7fffffffe228) at ceph_mon.cc:205
205     {
Missing separate debuginfos, use: debuginfo-install gperftools-libs-2.4-7.el7.x86_64 leveldb-1.12.0-5.el7cp.x86_64 libunwind-1.1-5.el7_2.2.x86_64 nss-util- snappy-1.1.0-3.el7.x86_64
(gdb) set pag off

Note that we want to find pthread_join()+0xa7 and that 0xa7 in decimal (gdb default) is 167.

(gdb) disass /m pthread_join,pthread_join+167
Dump of assembler code from 0x7ffff719de80 to 0x7ffff719df27:
88          /* There is already somebody waiting for the thread.  */
89          result = EINVAL;
90        else
91          /* Wait for the child.  */
92          lll_wait_tid (pd->tid);
   0x00007ffff719df06 <pthread_join+134>:       mov    0x2d0(%rbx),%ecx
   0x00007ffff719df0c <pthread_join+140>:       test   %ecx,%ecx
   0x00007ffff719df0e <pthread_join+142>:       mov    %ecx,%edx
   0x00007ffff719df10 <pthread_join+144>:       je     0x7ffff719df2c <pthread_join+172>
   0x00007ffff719df12 <pthread_join+146>:       lea    0x2d0(%rbx),%rdi
   0x00007ffff719df19 <pthread_join+153>:       xor    %esi,%esi
   0x00007ffff719df1b <pthread_join+155>:       xor    %r10,%r10
   0x00007ffff719df1e <pthread_join+158>:       mov    $0xca,%rax
   0x00007ffff719df25 <pthread_join+165>:       syscall 
   0x00007ffff719df27 <pthread_join+167>:       cmpl   $0x0,(%rdi)  <---HERE
   0x00007ffff719df2a <pthread_join+170>:       jne    0x7ffff719df1e <pthread_join+158>

So we crashed when we tried to dereference pd->tid.

"pd" is constructed from the passed in threadid.

39      int
40      pthread_join (threadid, thread_return)
41           pthread_t threadid;
42           void **thread_return;
43      {
44        struct pthread *pd = (struct pthread *) threadid;

The key to this may be the following statement from the pthread_join 3 man page.

"If multiple threads simultaneously try to join with the same thread, the results are undefined."

Here's where we attempt the join.

133     int Thread::join(void **prval)
134     {
135       if (thread_id == 0) {
136         assert("join on thread that was never started" == 0);
137         return -EINVAL;
138       }
140       int status = pthread_join(thread_id, prval); <----------HERE
141       assert(status == 0);
142       thread_id = 0;
143       return status;
144     }

(gdb) l DispatchQueue::wait
220       dispatch_thread.create();
221       local_delivery_thread.create();
222     }
224     void DispatchQueue::wait()
225     {
226       local_delivery_thread.join();
227       dispatch_thread.join(); <----------HERE
228     }

Just an invalid thread_id shouldn't cause this as pthread_join should return immediately under those circumstances.

6: (SimpleMessenger::wait()+0x9a6) [0x797e66]

(gdb) disass /m 0x797e66
537         dispatch_queue.wait();                     <--- HERE
   0x0000000000797e5a <+2458>:  lea    0x1c8(%rbx),%rdi
   0x0000000000797e61 <+2465>:  callq  0x8a3990 <DispatchQueue::wait()>

538         ldout(cct,10) << "wait: dispatch queue is stopped" << dendl;
   0x0000000000797e66 <+2470>:  mov    0xd0(%rbx),%rcx

That's this code.

    526 void SimpleMessenger::wait()
    527 {
    528   lock.Lock();
    529   if (!started) {
    530     lock.Unlock();
    531     return;
    532   }
    533   lock.Unlock();
    535   if(dispatch_queue.is_started()) {
    536     ldout(cct,10) << "wait: waiting for dispatch queue" << dendl;
    537     dispatch_queue.wait();
    538     ldout(cct,10) << "wait: dispatch queue is stopped" << dendl;
    539   }

Sam, any chance we are calling pthread_join from two threads simultaneously with the same thread_id?

The other possibility here is that we are working with a thread_id from a *failed* call to pthread_create. From that function's man page;

"On success, pthread_create() returns 0; on error, it returns an error number, and the contents of *thread are undefined."

Maybe we don't check the return code of pthread_create somewhere? AFAICT we check it in src/common/Thread.cc so perhaps the concurrency issue is more likely?
Comment 5 Greg Farnum 2016-02-22 14:09:33 EST
Okay, so the SimpleMessenger::wait() and DispatchQueue::wait() calls are pretty boring — in general, we start waiting as soon as we finish starting up. If we are in fact on the second thread join, that means we've already initiated shutdown (thanks for flagging that, Sam!), but I'm not seeing how we reached that conclusion.

My best guess at the moment is that something happened which caused the benign parent thread to get blamed for the failure — I believe we've seen that a couple of times before. If we've got a core dump we could try and dig through all the threads, but with this there's not much to go on and I think it's a false trail.
Comment 8 Greg Farnum 2016-02-22 15:02:58 EST
Looks like this was done deliberately by sending SIGABRT to the process; so this backtrace is exactly what we'd expect to see.
Comment 9 Brad Hubbard 2016-02-22 16:28:37 EST
(In reply to Greg Farnum from comment #8)
> Looks like this was done deliberately by sending SIGABRT to the process; so
> this backtrace is exactly what we'd expect to see.

Agreed, exactly what we would expect to see.

Note You need to log in before you can comment on or make changes to this bug.