Bug 1306242
Summary: | Monitor daemons are failing in 'void Thread::create(size_t) with assert in common/Thread.cc: 129: FAILED assert(ret == 0) followed by SIGABRT | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> | ||||
Component: | RADOS | Assignee: | Kefu Chai <kchai> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 1.3.1 | CC: | bhubbard, ceph-eng-bugs, dzafman, jbiao, jdurgin, kchai, ksquizza, linuxkidd, mbukatov, mhackett, sjust, skinjo, vumrao | ||||
Target Milestone: | rc | ||||||
Target Release: | 1.3.3 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-07-05 11:29:35 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1319075 | ||||||
Attachments: |
|
Description
Vikhyat Umrao
2016-02-10 11:53:16 UTC
So far the coredump looks as expected with the exception of number of threads. (gdb) info threads Id Target Id Frame 32635 Thread 0x7f5fd8a94700 (LWP 323831) 0x00007f662a60db7d in poll () at ../sysdeps/unix/syscall-template.S:81 32635 is a lot of threads and looks too high to me based on the previous discussions about 2*(OSDs+clients). The following test case seems to hit a similar limit. # cat threadmax.cc #include <iostream> #include <thread> #include <vector> #include <unistd.h> volatile size_t threadno = 1; void sleepy() { std::cout << "Creating thread " << threadno << std::endl; threadno++; while(1) sleep(1); } int main(int argc, char** argv) { size_t num_threads = 100000; std::vector<std::thread> threads; for (size_t i = 0; i < num_threads; i++) threads.push_back(std::thread(sleepy)); for(auto& thread : threads) thread.join(); return 0; } # ./threadmax ... Creating thread 32743 Creating thread 32744 Creating thread 32745 Creating thread 32746 Creating thread 32747 terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable Aborted I tried this on two RHEL7 systems and my F23 laptop with very similar results around the mid to late 32K threads. Looking into it further... > They are trying to understand why the pid_max value seems to trigger the behavior as the output below shows that the total system thread count came no where close to exceeding the values set in sysctl:
i don't think it triggers the behaviour.
monitor creates a new messenger upon accepting a new connection from its client, and every new (simple) messenger comes with two pthreads, one for reading, and one for writing. each pthreads requires two VMAs, one for the stack (8k by default) and the thread's TCB, and another 4K red zone on top of the 8K stack to detect stack overflows.
we can observe the maps allocated by threadmax using:
$ cat /proc/<pid>/maps
02575000-025a7000 rw-p 00000000 00:00 0 [heap]
7f52f25f3000-7f52f25f4000 ---p 00000000 00:00 0
7f52f25f4000-7f52f2df4000 rw-p 00000000 00:00 0 [stack:35774]
7f52f2df4000-7f52f2df5000 ---p 00000000 00:00 0
7f52f2df5000-7f52f35f5000 rw-p 00000000 00:00 0 [stack:35773]
7f52f35f5000-7f52f35f6000 ---p 00000000 00:00 0
7f52f35f6000-7f52f3df6000 rw-p 00000000 00:00 0 [stack:35772]
and by default the max_map_count for rhel is 65530, see
$ cat /proc/sys/vm/max_map_count
65530
so we can have less than 32765 threads if other limits are not breached
$ echo $((65530/2))
32765
the "other limits" are
$ cat /proc/sys/kernel/threads-max # the max number of threads for a process
$ ulimit -s # the stack size
$ ulimit -v # and the virtual memory of allowed for the user
$ free # and the available virtual memory in the system, of course.
i don't think changing /proc/sys/kernel/pid_max relieve the symptom here.
the stack size of the pthread created by simple messenger is set by the setting of "ms_rwthread_stack_bytes" (10k by default).
> So far the coredump looks as expected with the exception of number of threads.
the number of 32747 matches with the math i did in last comment. 65530/2 ~= 32747. as the code/data/bss segments and the heap of the executable also need some mapped areas.
the problem is more likely from a persistent client
by inspecting the log (ceph-mon.css-a-ceph1-018.log.xz) attached in case 01578779, there were zillions of connection from different clients before the monitor running out of its memory map for creating the messengers.
$ grep waitlisting ceph-mon.css-a-ceph1-018.log|less | grep -v ' -' | head -n1
2016-02-12 17:51:33.588797 7f0a0c1ab700 5 mon.css-a-ceph1-018@2(probing) e5 waitlisting message auth(proto 0 27 bytes epoch 5) v1
$ grep waitlisting ceph-mon.css-a-ceph1-018.log|less | grep -v ' -' | tail -n2
2016-02-12 17:52:46.905095 7f0a0c1ab700 5 mon.css-a-ceph1-018@2(probing) e5 waitlisting message auth(proto 0 53 bytes epoch 5) v1
-10000> 2016-02-12 17:52:43.874836 7f0a0c9ac700 5 mon.css-a-ceph1-018@2(probing) e5 waitlisting message auth(proto 0 53 bytes epoch 5) v1
$ grep waitlisting ceph-mon.css-a-ceph1-018.log | grep -v ' -' | wc -l
64185
so, 64185 auth messages got waitlisted in merely 1 minutes when the monitor is not readable. it was still probing at that moment.
and the clients (if not all of them) were initializing the connection and trying to authroizing themselves multiple times in a short span of time:
$ grep ' -' ceph-mon.css-a-ceph1-018.log | grep '10.202.32.42'
/// excerpt from the output
-229> 2016-02-12 17:52:46.887974 7f0210e93700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x2648c800 sd=16290 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622e460).accept sd=16290 10.202.32.42:43874/0
-228> 2016-02-12 17:52:46.888168 7f0210d92700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x26488000 sd=16291 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622e1a0).accept sd=16291 10.202.32.42:43875/0
-226> 2016-02-12 17:52:46.888547 7f0210b90700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x26495800 sd=16293 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622dc20).accept sd=16293 10.202.32.42:43876/0
-225> 2016-02-12 17:52:46.888758 7f0210a8f700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264b1000 sd=16294 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622d960).accept sd=16294 10.202.32.42:43877/0
-224> 2016-02-12 17:52:46.888944 7f021098e700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264ac800 sd=16295 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622d6a0).accept sd=16295 10.202.32.42:43878/0
-221> 2016-02-12 17:52:46.889122 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106132098 10.202.32.42:0/342214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (807903665 0 0) 0x264c0000 con 0x2622f220
-219> 2016-02-12 17:52:46.889132 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106124598 10.202.32.42:0/187214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (1184137050 0 0) 0x264c8000 con 0x2622ef60
-215> 2016-02-12 17:52:46.889151 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106128293 10.202.32.42:0/265214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (1589671855 0 0) 0x264c8240 con 0x2622e460
-209> 2016-02-12 17:52:46.889499 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106126148 10.202.32.42:0/219214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3129217306 0 0) 0x264e0000 con 0x2622e1a0
-206> 2016-02-12 17:52:46.889514 7f021058a700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264ba000 sd=16297 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622d120).accept sd=16297 10.202.32.42:43879/0
-205> 2016-02-12 17:52:46.889617 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106173048 10.202.32.42:0/1118214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3921764447 0 0) 0x264e0240 con 0x2622d6a0
-203> 2016-02-12 17:52:46.889713 7f020fd82700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264b5800 sd=16298 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622ce60).accept sd=16298 10.202.32.42:43880/0
-202> 2016-02-12 17:52:46.889798 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106115318 10.202.32.42:0/1214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3801932590 0 0) 0x264e0480 con 0x2622dc20
-200> 2016-02-12 17:52:46.889898 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106150003 10.202.32.42:0/681214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (2483536928 0 0) 0x264e06c0 con 0x2622d960
-198> 2016-02-12 17:52:46.889901 7f020fc81700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264f9000 sd=16299 :6789 s=0 pgs=0 cs=0 l=0 c=0x2622cba0).accept sd=16299 10.202.32.42:43881/0
-197> 2016-02-12 17:52:46.889920 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106120598 10.202.32.42:0/106214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3759780493 0 0) 0x264e0900 con 0x2622eca0
-195> 2016-02-12 17:52:46.890107 7f020fb80700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x264f4800 sd=16300 :6789 s=0 pgs=0 cs=0 l=0 c=0x261f1600).accept sd=16300 10.202.32.42:43882/0
-193> 2016-02-12 17:52:46.890474 7f020f97e700 1 -- 10.202.33.28:6789/0 >> :/0 pipe(0x26502000 sd=16302 :6789 s=0 pgs=0 cs=0 l=0 c=0x261f1080).accept sd=16302 10.202.32.42:43883/0
-192> 2016-02-12 17:52:46.890929 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106121688 10.202.32.42:0/128214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (1473882682 0 0) 0x264e0b40 con 0x2622cba0
-188> 2016-02-12 17:52:46.891068 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106194383 10.202.32.42:0/1530214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (3806264536 0 0) 0x264e0fc0 con 0x2622d120
-186> 2016-02-12 17:52:46.891237 7f0a0c1ab700 1 -- 10.202.33.28:6789/0 <== client.106127858 10.202.32.42:0/256214596 1 ==== auth(proto 0 53 bytes epoch 5) v1 ==== 83+0+0 (2652142099 0 0) 0x264e1200 con 0x2622ce60
10 connections from 10.202.32.42 in less than 1 second (17:52:46.890474 - 17:52:46.887974), and there are more of them from the same host. monitor have read the auth message from the connections, then waitlisted them.
so questions:
1. why mon.css-a-ceph1-018 failed to connect to its peers after it started? i suspect it's a connection problem? see the log from the same log file:
2016-02-12 17:51:33.587518 7f0a15ddc7c0 2 mon.css-a-ceph1-018@-1(probing) e5 init
2016-02-12 17:51:33.587600 7f0a15ddc7c0 0 mon.css-a-ceph1-018@-1(probing) e5 my rank is now 2 (was -1)
2016-02-12 17:51:33.588029 7f0a0a9a8700 0 -- 10.202.33.28:6789/0 >> 10.202.33.11:6789/0 pipe(0x5f2c800 sd=14 :0 s=1 pgs=0 cs=0 l=0 c=0x3a49600).fault
2016-02-12 17:51:33.588034 7f0a0a6a5700 0 -- 10.202.33.28:6789/0 >> 10.202.33.45:6789/0 pipe(0x5f31000 sd=17 :0 s=1 pgs=0 cs=0 l=0 c=0x3a4aaa0).fault
2016-02-12 17:51:33.588033 7f0a0a8a7700 0 -- 10.202.33.28:6789/0 >> 10.202.33.12:6789/0 pipe(0x5f35800 sd=16 :0 s=1 pgs=0 cs=0 l=0 c=0x3a4ad60).fault
2016-02-12 17:51:33.588107 7f0a0a4a3700 0 -- 10.202.33.28:6789/0 >> 10.202.33.46:6789/0 pipe(0x5f5a000 sd=19 :0 s=1 pgs=0 cs=0 l=0 c=0x3a4a7e0).fault
2. why the client just kept connecting to this monitor in such a short time period?
3. can monitor detect and avoid such sort of DoS attack from its clients? well, this is a bigger problem we need to address.
thanks for gregsfortytwo's reminder. note to myself, i need to check the latest code to see if we have fixed this issue or not.
and also i found that monitor failed to read the first bits from the connection after accepting it. -1632> 2016-02-12 17:52:46.447852 7f02b118c700 2 -- 10.202.33.28:6789/0 >> 10.202.32.42:0/1847214596 pipe(0x236a4800 sd=15020 :6789 s=2 pgs=260 cs=1 l=1 c=0x23472ec0).reader couldn't read tag, (0) Success -1631> 2016-02-12 17:52:46.447875 7f02b118c700 2 -- 10.202.33.28:6789/0 >> 10.202.32.42:0/1847214596 pipe(0x236a4800 sd=15020 :6789 s=2 pgs=260 cs=1 l=1 c=0x23472ec0).fault (0) Success but the errno does not help, though. > 2. why the client just kept connecting to this monitor in such a short time period? per the comment from https://access.redhat.com/support/cases/#/case/01578779?commentId=a0aA000000GXVUVIA5, > * So in Cisco's case, number of clients would be RBD processes can we get the debug-rados=20, debug-ms=10 log from one of the tireless rbd clients ? for example, - 10.202.32.41 - 10.202.32.42 - 10.202.32.44 - 10.202.32.54 vikhyat, could you help with the first and second question? thanks. (In reply to Kefu Chai from comment #12) > the stack size of the pthread created by simple messenger is set by the > setting of "ms_rwthread_stack_bytes" (10k by default). > > > So far the coredump looks as expected with the exception of number of threads. > > the number of 32747 matches with the math i did in last comment. 65530/2 ~= > 32747. as the code/data/bss segments and the heap of the executable also > need some mapped areas. > Confirmed that Kefu is, as always, correct :) # sysctl -w vm.max_map_count=131060 # cat /proc/sys/vm/max_map_count 131060 # ./threadmax 2>&1|tail -5 Creating thread 65510 Creating thread 65511 Creating thread 65512 terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable BZ 1306242 was open for crash seen in tokyo. That BZ can be referenced to debug that crash. Created attachment 1128865 [details]
Mon logs from mon ceph-mon.css-a-ceph1-001
I just got up out of bed because I suddenly realised this and bz1310522 are highly likely the same issue. They are both crashes due to exceeding the vm.max_map_count because of the number of threads created (32500+). That is why tc_new() is failing, because it can't allocate memory for the same reason that Thread::create() can't allocate memory. So, IMHO these are two crashes with the same cause, the huge number of threads being created. They were bound to be the same issue because I asked for them to be separated as different issues :P Tis theory needs s second or even a third opinion but I think it's sound. It also explains why they happen in quick succession as all three of the monitors are suffering a "thread bomb" at around the same time. *** Bug 1310522 has been marked as a duplicate of this bug. *** |