Bug 1306242 - Monitor daemons are failing in 'void Thread::create(size_t) with assert in common/Thread.cc: 129: FAILED assert(ret == 0) followed by SIGABRT
Summary: Monitor daemons are failing in 'void Thread::create(size_t) with assert in c...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 1.3.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 1.3.3
Assignee: Kefu Chai
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
: 1310522 (view as bug list)
Depends On:
Blocks: 1319075
TreeView+ depends on / blocked
 
Reported: 2016-02-10 11:53 UTC by Vikhyat Umrao
Modified: 2022-03-13 14:12 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-05 11:29:35 UTC
Embargoed:


Attachments (Terms of Use)
Mon logs from mon ceph-mon.css-a-ceph1-001 (2.21 MB, application/x-gzip)
2016-02-20 21:52 UTC, Mike Hackett
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-3708 0 None None None 2022-03-13 14:12:51 UTC

Description Vikhyat Umrao 2016-02-10 11:53:16 UTC
Description of problem:
----------------------------------------

Monitor daemon are failing in  'void Thread::create(size_t) with assert in common/Thread.cc: 129: FAILED assert(ret == 0) followed by SIGABRT


 0> 2016-02-04 09:56:46.290098 7f570b89b700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f570b89b700 time 2016-02-04 09:56:46.289345
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7b5fc5]
 2: (Thread::create(unsigned long)+0x8a) [0x7a1b0a]
 3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x798d7f]
 4: (Accepter::entry()+0x365) [0x893855]
 5: (()+0x7df5) [0x7f5734f4edf5]
 6: (clone()+0x6d) [0x7f5733a311ad]

..................
..................

log_file /var/log/ceph/ceph-mon.css-a-ceph1-001.log
--- end dump of recent events ---
2016-02-04 09:56:46.324485 7f570b89b700 -1 *** Caught signal (Aborted) **
 in thread 7f570b89b700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)


Version-Release number of selected component (if applicable):
-----------------------------------------------------------------------------------------------

Red Hat Ceph Storage 1.3 (hammer based)
ceph-mon-0.94.1-19.el7cp.x86_64   

https://access.redhat.com/support/cases/#/case/01578779?commentId=a0aA000000GAY23IAH
https://access.redhat.com/support/cases/#/case/01578779?commentId=a0aA000000GAbCtIAL
The ones from Dec 4 are actually occurring on firefly: 
-  It was in firefly also.

Comment 10 Brad Hubbard 2016-02-16 04:10:38 UTC
So far the coredump looks as expected with the exception of number of threads.

(gdb) info threads
  Id   Target Id         Frame 
  32635 Thread 0x7f5fd8a94700 (LWP 323831) 0x00007f662a60db7d in poll () at ../sysdeps/unix/syscall-template.S:81

32635 is a lot of threads and looks too high to me based on the previous discussions about 2*(OSDs+clients).

The following test case seems to hit a similar limit.

# cat threadmax.cc
#include <iostream>
#include <thread>
#include <vector>
#include <unistd.h>

volatile size_t threadno = 1;

void sleepy() {
    std::cout << "Creating thread "
        << threadno << std::endl;
    threadno++;

    while(1)
        sleep(1);
}

int main(int argc, char** argv) {

    size_t num_threads = 100000;

    std::vector<std::thread> threads;
    for (size_t i = 0; i < num_threads; i++)
        threads.push_back(std::thread(sleepy));
    for(auto& thread : threads)
        thread.join();

    return 0;
}

# ./threadmax
...
Creating thread 32743
Creating thread 32744
Creating thread 32745
Creating thread 32746
Creating thread 32747
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
Aborted

I tried this on two RHEL7 systems and my F23 laptop with very similar results around the mid to late 32K threads. Looking into it further...

Comment 11 Kefu Chai 2016-02-16 05:59:45 UTC
> They are trying to understand why the pid_max value seems to trigger the behavior as the output below shows that the total system thread count came no where close to exceeding the values set in sysctl:

i don't think it triggers the behaviour.

monitor creates a new messenger upon accepting a new connection from its client, and every new (simple) messenger comes with two pthreads, one for reading, and one for writing. each pthreads requires two VMAs, one for the stack (8k by default) and the thread's TCB, and another 4K red zone on top of the 8K stack to detect stack overflows.

we can observe the maps allocated by threadmax using:
$ cat /proc/<pid>/maps 

02575000-025a7000 rw-p 00000000 00:00 0                                  [heap]
7f52f25f3000-7f52f25f4000 ---p 00000000 00:00 0
7f52f25f4000-7f52f2df4000 rw-p 00000000 00:00 0                          [stack:35774]
7f52f2df4000-7f52f2df5000 ---p 00000000 00:00 0
7f52f2df5000-7f52f35f5000 rw-p 00000000 00:00 0                          [stack:35773]
7f52f35f5000-7f52f35f6000 ---p 00000000 00:00 0
7f52f35f6000-7f52f3df6000 rw-p 00000000 00:00 0                          [stack:35772]

and by default the max_map_count for rhel is 65530, see

$ cat /proc/sys/vm/max_map_count
65530

so we can have less than 32765 threads if other limits are not breached
$ echo $((65530/2))
32765

the "other limits" are 

$ cat /proc/sys/kernel/threads-max # the max number of threads for a process
$ ulimit -s # the stack size 
$ ulimit -v # and the virtual memory of allowed for the user
$ free # and the available virtual memory in the system, of course.

i don't think changing /proc/sys/kernel/pid_max relieve the symptom here.

Comment 12 Kefu Chai 2016-02-16 07:07:40 UTC
the stack size of the pthread created by simple messenger is set by the setting of "ms_rwthread_stack_bytes" (10k by default).

> So far the coredump looks as expected with the exception of number of threads.

the number of 32747 matches with the math i did in last comment. 65530/2 ~= 32747. as the code/data/bss segments and the heap of the executable also need some mapped areas.


the problem is more likely from a persistent client

by inspecting the log (ceph-mon.css-a-ceph1-018.log.xz) attached in case 01578779, there were zillions of connection from different clients before the monitor running out of its memory map for creating the messengers.

$ grep waitlisting ceph-mon.css-a-ceph1-018.log|less | grep -v ' -' | head -n1
2016-02-12 17:51:33.588797 7f0a0c1ab700  5 mon.css-a-ceph1-018@2(probing) e5 waitlisting message auth(proto 0 27 bytes epoch 5) v1
$ grep waitlisting ceph-mon.css-a-ceph1-018.log|less | grep -v ' -' | tail -n2
2016-02-12 17:52:46.905095 7f0a0c1ab700  5 mon.css-a-ceph1-018@2(probing) e5 waitlisting message auth(proto 0 53 bytes epoch 5) v1
-10000> 2016-02-12 17:52:43.874836 7f0a0c9ac700  5 mon.css-a-ceph1-018@2(probing) e5 waitlisting message auth(proto 0 53 bytes epoch 5) v1
$ grep waitlisting ceph-mon.css-a-ceph1-018.log | grep -v ' -' | wc -l
   64185

so, 64185 auth messages got waitlisted in merely 1 minutes when the monitor is not readable. it was still probing at that moment.

and the clients (if not all of them) were initializing the connection and trying to authroizing themselves multiple times in a short span of time:

$ grep ' -' ceph-mon.css-a-ceph1-018.log | grep '10.202.32.42'
/// excerpt from the output 
  -229> 2016-02-12 17:52:46.887974 7f0210e93700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x2648c800 sd=16290 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622e460).accept sd=16290 10.202.32.42:43874/0
  -228> 2016-02-12 17:52:46.888168 7f0210d92700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x26488000 sd=16291 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622e1a0).accept sd=16291 10.202.32.42:43875/0
  -226> 2016-02-12 17:52:46.888547 7f0210b90700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x26495800 sd=16293 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622dc20).accept sd=16293 10.202.32.42:43876/0
  -225> 2016-02-12 17:52:46.888758 7f0210a8f700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264b1000 sd=16294 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622d960).accept sd=16294 10.202.32.42:43877/0
  -224> 2016-02-12 17:52:46.888944 7f021098e700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264ac800 sd=16295 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622d6a0).accept sd=16295 10.202.32.42:43878/0
  -221> 2016-02-12 17:52:46.889122 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106132098 10.202.32.42:0/342214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (807903665 0 0) 0x264c0000 con 0x2622f220
  -219> 2016-02-12 17:52:46.889132 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106124598 10.202.32.42:0/187214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (1184137050 0 0) 0x264c8000 con 0x2622ef60
  -215> 2016-02-12 17:52:46.889151 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106128293 10.202.32.42:0/265214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (1589671855 0 0) 0x264c8240 con 0x2622e460
  -209> 2016-02-12 17:52:46.889499 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106126148 10.202.32.42:0/219214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3129217306 0 0) 0x264e0000 con 0x2622e1a0
  -206> 2016-02-12 17:52:46.889514 7f021058a700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264ba000 sd=16297 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622d120).accept sd=16297 10.202.32.42:43879/0
  -205> 2016-02-12 17:52:46.889617 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106173048 10.202.32.42:0/1118214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3921764447 0 0) 0x264e0240 con 0x2622d6a0
  -203> 2016-02-12 17:52:46.889713 7f020fd82700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264b5800 sd=16298 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622ce60).accept sd=16298 10.202.32.42:43880/0
  -202> 2016-02-12 17:52:46.889798 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106115318 10.202.32.42:0/1214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3801932590 0 0) 0x264e0480 con 0x2622dc20
  -200> 2016-02-12 17:52:46.889898 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106150003 10.202.32.42:0/681214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (2483536928 0 0) 0x264e06c0 con 0x2622d960
  -198> 2016-02-12 17:52:46.889901 7f020fc81700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264f9000 sd=16299 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622cba0).accept sd=16299 10.202.32.42:43881/0
  -197> 2016-02-12 17:52:46.889920 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106120598 10.202.32.42:0/106214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3759780493 0 0) 0x264e0900 con 0x2622eca0
  -195> 2016-02-12 17:52:46.890107 7f020fb80700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264f4800 sd=16300 :6789 s=0 pgs=0 cs=0 l=0 c=0x261f1600).accept sd=16300 10.202.32.42:43882/0
  -193> 2016-02-12 17:52:46.890474 7f020f97e700  1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x26502000 sd=16302 :6789 s=0 pgs=0 cs=0 l=0 c=0x261f1080).accept sd=16302 10.202.32.42:43883/0
  -192> 2016-02-12 17:52:46.890929 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106121688 10.202.32.42:0/128214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (1473882682 0 0) 0x264e0b40 con 0x2622cba0
  -188> 2016-02-12 17:52:46.891068 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106194383 10.202.32.42:0/1530214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3806264536 0 0) 0x264e0fc0 con 0x2622d120
  -186> 2016-02-12 17:52:46.891237 7f0a0c1ab700  1 -- 10.202.33.28:6789/0 <== client.106127858 10.202.32.42:0/256214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (2652142099 0 0) 0x264e1200 con 0x2622ce60

10 connections from 10.202.32.42 in less than 1 second (17:52:46.890474 - 17:52:46.887974), and there are more of them from the same host. monitor have read the auth message from the connections, then waitlisted them.

so questions:

1. why mon.css-a-ceph1-018 failed to connect to its peers after it started? i suspect it's a connection problem? see the log from the same log file:

2016-02-12 17:51:33.587518 7f0a15ddc7c0  2 mon.css-a-ceph1-018@-1(probing) e5 init
2016-02-12 17:51:33.587600 7f0a15ddc7c0  0 mon.css-a-ceph1-018@-1(probing) e5  my rank is now 2 (was -1)
2016-02-12 17:51:33.588029 7f0a0a9a8700  0 -- 10.202.33.28:6789/0 >> 10.202.33.11:6789/0 pipe(0x5f2c800 sd=14 :0 s=1 pgs=0 cs=0 l=0 c=0x3a49600).fault
2016-02-12 17:51:33.588034 7f0a0a6a5700  0 -- 10.202.33.28:6789/0 >> 10.202.33.45:6789/0 pipe(0x5f31000 sd=17 :0 s=1 pgs=0 cs=0 l=0 c=0x3a4aaa0).fault
2016-02-12 17:51:33.588033 7f0a0a8a7700  0 -- 10.202.33.28:6789/0 >> 10.202.33.12:6789/0 pipe(0x5f35800 sd=16 :0 s=1 pgs=0 cs=0 l=0 c=0x3a4ad60).fault
2016-02-12 17:51:33.588107 7f0a0a4a3700  0 -- 10.202.33.28:6789/0 >> 10.202.33.46:6789/0 pipe(0x5f5a000 sd=19 :0 s=1 pgs=0 cs=0 l=0 c=0x3a4a7e0).fault

2. why the client just kept connecting to this monitor in such a short time period?
3. can monitor detect and avoid such sort of DoS attack from its clients? well, this is a bigger problem we need to address.

thanks for gregsfortytwo's reminder. note to myself, i need to check the latest code to see if we have fixed this issue or not.

Comment 13 Kefu Chai 2016-02-16 07:49:28 UTC
and also i found that monitor failed to read the first bits from the connection after accepting it.

 -1632> 2016-02-12 17:52:46.447852 7f02b118c700  2 -- 10.202.33.28:6789/0 >> 10.202.32.42:0/1847214596 pipe(0x236a4800 sd=15020 :6789 s=2 pgs=260 cs=1 l=1 c=0x23472ec0).reader couldn't read tag, (0) Success
 -1631> 2016-02-12 17:52:46.447875 7f02b118c700  2 -- 10.202.33.28:6789/0 >> 10.202.32.42:0/1847214596 pipe(0x236a4800 sd=15020 :6789 s=2 pgs=260 cs=1 l=1 c=0x23472ec0).fault (0) Success

but the errno does not help, though.

> 2. why the client just kept connecting to this monitor in such a short time period?

per the comment from https://access.redhat.com/support/cases/#/case/01578779?commentId=a0aA000000GXVUVIA5, 

>  * So in Cisco's case, number of clients would be RBD processes 

can we get the debug-rados=20, debug-ms=10 log from one of the tireless rbd clients ? for example, 
- 10.202.32.41
- 10.202.32.42
- 10.202.32.44
- 10.202.32.54

Comment 14 Kefu Chai 2016-02-16 13:23:24 UTC
vikhyat, could you help with the first and second question? thanks.

Comment 17 Brad Hubbard 2016-02-17 00:51:46 UTC
(In reply to Kefu Chai from comment #12)
> the stack size of the pthread created by simple messenger is set by the
> setting of "ms_rwthread_stack_bytes" (10k by default).
> 
> > So far the coredump looks as expected with the exception of number of threads.
> 
> the number of 32747 matches with the math i did in last comment. 65530/2 ~=
> 32747. as the code/data/bss segments and the heap of the executable also
> need some mapped areas.
> 

Confirmed that Kefu is, as always, correct :)

# sysctl -w vm.max_map_count=131060
# cat /proc/sys/vm/max_map_count
131060
# ./threadmax 2>&1|tail -5
Creating thread 65510
Creating thread 65511
Creating thread 65512
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

Comment 26 Mike Hackett 2016-02-20 21:51:42 UTC
BZ 1306242 was open for crash seen in tokyo. That BZ can be referenced to debug that crash.

Comment 27 Mike Hackett 2016-02-20 21:52:48 UTC
Created attachment 1128865 [details]
Mon logs from mon ceph-mon.css-a-ceph1-001

Comment 39 Brad Hubbard 2016-02-24 13:16:47 UTC
I just got up out of bed because I suddenly realised this and bz1310522 are highly likely the same issue.

They are both crashes due to exceeding the vm.max_map_count because of the number of threads created (32500+). That is why tc_new() is failing, because it can't allocate memory for the same reason that Thread::create() can't allocate memory. So, IMHO these are two crashes with the same cause, the huge number of threads being created.

They were bound to be the same issue because I asked for them to be separated as different issues :P

Tis theory needs s second or even a third opinion but I think it's sound. It also explains why they happen in quick succession as all three of the monitors are suffering a "thread bomb" at around the same time.

Comment 40 Vikhyat Umrao 2016-02-25 06:11:55 UTC
*** Bug 1310522 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.